Linux Blog

Optimizing Shell Scripts

Filed under: Shell Script Sundays — TheLinuxBlog.com at 6:30 pm on Sunday, January 23, 2011

optimizing shell scripts

I’ll be honest, I’m no expert on optimizing shell scripts. I’m hoping that readers will chime in with their tips / experiences. With that being said I do have a few tricks up my sleeve from hands on experience with code optimization using other languages.

Use time to get a baseline

Any performance testing normally starts with a baseline. It can be hard to tell which direction you need to go, when you don’t know where you started. Using the Linux time command, you can get an baseline which you can use to track any performance increases.

Consider changing types of loops

I’m not sure how much this makes a difference in bash, but it can make a huge difference in other languages. It is also important to make sure there is no repetition of operations within the clause, because this too will get repeated on every iteration.

time for i in `seq 1 10000`;
do echo $i; done;
time seq 1 10000 |
while read i; do echo $i; done;
real    0m0.410s
user    0m0.327s
sys    0m0.077s
real    0m0.626s
user    0m0.472s
sys    0m0.164s

Remove unneeded output

time for i in `seq 1 100000`; do echo $i; done; time for i in `seq 1 100000`; do true ; done;
real    0m3.172s
user    0m2.480s
sys    0m0.502s
real    0m1.105s
user    0m1.087s
sys    0m0.014s

Backgrounding

If you have processes that may take a while, you can always use backgrounding to put them in the background while you perform other operations. This may or may not work depending on the situation.

Change Shell

csh zsh ksh Dash
real    0m0.409s
user    0m0.340s
sys    0m0.065s
real    0m0.408s
user    0m0.324s
sys    0m0.078s
real    0m0.21s
user    0m0.05s
sys    0m0.01s
real    0m0.409s
user    0m0.328s
sys    0m0.078s

Remove needed Comments and lines

Most compiled languages remove comments so that they don’t appear in the binary, since bash is an interpreted language this is not the case. If there is a huge number of comments in a script it can cause some sluggishness as it interprets each line.

Use sed to remove them, keep another copy or branch for development and distribute the de-commented version.

Database Vs. Files

Using a database versus. files can give you a performance boost. Think about it, writing files takes up processing time, it then takes time to read those files. Processing and performing lookups on data using text files is slow, using a database such as MySQL or SQLite will work and give you results.

Inserting into SQLite can take longer than writing to a file (at least the way I was doing it) which may or may not be a problem depending on what you’re trying to do. The resulting file is also larger than a plain text file, which I assume is due to indexes.

Here is an example of reading from a text file versus reading from sqlite:

sqlite sqlite.db “select * from test”;

While this may not seem significant, try selecting lines 1,10,100,150,200 and 9000-10000 delimited by pipes.

Commands Used:

time awk ‘NR==1;NR==10;NR==100;NR==150;NR==200;NR==9000,NR==10000;’ bash.txt | sed ‘s/ /\|/’
time sqlite sqlite.db “SELECT * from test WHERE one IN (1,10,100,150,200) OR one BETWEEN 9000 AND 10000″

AWK SQLite
real    0m0.023s
user    0m0.020s
sys     0m0.004s
real    0m0.019s
user    0m0.020s
sys     0m0.000s

Not much of a difference here, but if you’re working with real world data and need to select certain rows it can make a huge difference take the example of looking for the string Hello within 10000 rows:

time grep Hello bash.txt
time sqlite sqlite.db “SELECT * from test where two LIKE ‘%Hello%'”

grep SQLite
real    0m0.157s
user    0m0.024s
sys     0m0.028s
real    0m0.065s
user    0m0.000s
sys     0m0.048s

Use the right tool for the job

You may try to hammer a nail into the wall, but a screwdriver or drill will work much better. Using the correct application is key, knowing what tool is best to use and for what purpose is the tricky part.

Take a look at cut vs sed vs awk:

time apt-cache search python | (below)
cut -d \  -f 1 awk ‘{print $1}’ sed ‘s/\(.*\?\)\ \-\ \(.*\)/\1/’
real    0m0.766s
user    0m0.692s
sys    0m0.040s
real    0m0.759s
user    0m0.680s
sys    0m0.052s
real    0m0.864s
user    0m0.804s
sys    0m0.012s

Perhaps this one isn’t fair. The sed expression doesn’t do it properly. If some one with uber sed-fu can do it a better way, let me know in the comments and I’ll bench mark it. This is the closest to awk and cut that I came up with so this is whats represented for now.

Use Better Syntax

wc -l file.txt cat file.txt | wc -l
real    0m0.009s
user    0m0.000s
sys    0m0.008s
real    0m0.018s
user    0m0.001s
sys    0m0.025s

The same goes for a lot of other programs including grep:

grep test file.txt cat file.txt | grep test
real 0m0.009s real 0m0.017s

Process in Parallel

If performance is a huge concern, you could consider performing operations in parallel. Projects like distcc achieve awesomely fast compile times when distributing load over a number of hosts. You’d think that using this kind of technique with shell scripts would result in a considerable performance boost. From my testing it seems to produce varying results, you can use the techniques outlined here: http://pebblesinthesand.wordpress.com/2008/05/22/a-srcipt-for-running-processes-in-parallel-in-bash/ to see if your scripts can benefit from parallel processing.

Consider Changing Languages

This may be sacrilegious to die hard scripters but I’ve written before about when not to script it, this was more along the lines of why make something so complex when a shell script can do it? When performance matters is the answer.

Simple test to echo hello 10,000 times

C++ Bash (time for i in `seq 1 10000`;
do echo Hello; done;)
Python PHP Perl Java
real    0m0.039s
user    0m0.004s
sys     0m0.024s
real    0m0.202s
user    0m0.144s
sys     0m0.040s
real    0m0.058s
user    0m0.032s
sys     0m0.024s
real    0m0.057s
user    0m0.028s
sys     0m0.016s
real    0m0.043s
user    0m0.000s
sys     0m0.028s
real    0m0.212s
user    0m0.124s
sys     0m0.052s

Man Pages for commands in this post »

time
sed
awk
grep

4 Comments »

Comment by georges

January 24, 2011 @ 11:08 am

Thanks for that post!
Another thing I’ve done before in optimizing code, is use an analyzer that tells me which part of the code represents the most time.
Often you find that 80% of the time is spent on 20% of the code.
Then you focus on optimizing that part of the code. When you’re done optimizing that, redo the analysis, and again focus on the top routine.

For C language, the tools were called pixie and prof. Don’t know the equivalents in shell languages though.

You also need to remove human interaction in a benchmark, this gets you reliable results. i.e: do not get input from the keyboard, but from a text file.

Also, you need to get a reference output result. Then after each run, compare your output to the reference one. If the outputs match, then you can compare times and make decisions. Otherwise, fix the bug you just introduced, and start again

Comment by TheLinuxBlog.com

January 24, 2011 @ 11:10 am

Great tips!

Comment by Arpit

March 29, 2011 @ 5:20 am

You also need to remove human interaction in a benchmark, this gets you reliable results. i.e: do not get input from the keyboard, but from file.

http://www.indianhostinginfo.com/search/label/Linux%20Tutorial

Comment by sathish

January 21, 2013 @ 2:36 am

Its really good information.

RSS feed for comments on this post. TrackBack URI

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>