Optimizing Shell Scripts
I’ll be honest, I’m no expert on optimizing shell scripts. I’m hoping that readers will chime in with their tips / experiences. With that being said I do have a few tricks up my sleeve from hands on experience with code optimization using other languages.
Use time to get a baseline
Any performance testing normally starts with a baseline. It can be hard to tell which direction you need to go, when you don’t know where you started. Using the Linux time command, you can get an baseline which you can use to track any performance increases.
Consider changing types of loops
I’m not sure how much this makes a difference in bash, but it can make a huge difference in other languages. It is also important to make sure there is no repetition of operations within the clause, because this too will get repeated on every iteration.
time for i in `seq 1 10000`; do echo $i; done; |
time seq 1 10000 | while read i; do echo $i; done; |
real 0m0.410s user 0m0.327s sys 0m0.077s |
real 0m0.626s user 0m0.472s sys 0m0.164s |
Remove unneeded output
time for i in `seq 1 100000`; do echo $i; done; | time for i in `seq 1 100000`; do true ; done; |
real 0m3.172s user 0m2.480s sys 0m0.502s |
real 0m1.105s user 0m1.087s sys 0m0.014s |
Backgrounding
If you have processes that may take a while, you can always use backgrounding to put them in the background while you perform other operations. This may or may not work depending on the situation.
Change Shell
csh | zsh | ksh | Dash |
real 0m0.409s user 0m0.340s sys 0m0.065s |
real 0m0.408s user 0m0.324s sys 0m0.078s |
real 0m0.21s user 0m0.05s sys 0m0.01s |
real 0m0.409s user 0m0.328s sys 0m0.078s |
Remove needed Comments and lines
Most compiled languages remove comments so that they don’t appear in the binary, since bash is an interpreted language this is not the case. If there is a huge number of comments in a script it can cause some sluggishness as it interprets each line.
Use sed to remove them, keep another copy or branch for development and distribute the de-commented version.
Database Vs. Files
Using a database versus. files can give you a performance boost. Think about it, writing files takes up processing time, it then takes time to read those files. Processing and performing lookups on data using text files is slow, using a database such as MySQL or SQLite will work and give you results.
Inserting into SQLite can take longer than writing to a file (at least the way I was doing it) which may or may not be a problem depending on what you’re trying to do. The resulting file is also larger than a plain text file, which I assume is due to indexes.
Here is an example of reading from a text file versus reading from sqlite:
sqlite sqlite.db “select * from test”;
While this may not seem significant, try selecting lines 1,10,100,150,200 and 9000-10000 delimited by pipes.
Commands Used:
time awk ‘NR==1;NR==10;NR==100;NR==150;NR==200;NR==9000,NR==10000;’ bash.txt | sed ‘s/ /\|/’
time sqlite sqlite.db “SELECT * from test WHERE one IN (1,10,100,150,200) OR one BETWEEN 9000 AND 10000”
AWK | SQLite |
real 0m0.023s user 0m0.020s sys 0m0.004s |
real 0m0.019s user 0m0.020s sys 0m0.000s |
Not much of a difference here, but if you’re working with real world data and need to select certain rows it can make a huge difference take the example of looking for the string Hello within 10000 rows:
time grep Hello bash.txt
time sqlite sqlite.db “SELECT * from test where two LIKE ‘%Hello%'”
grep | SQLite |
real 0m0.157s user 0m0.024s sys 0m0.028s |
real 0m0.065s user 0m0.000s sys 0m0.048s |
Use the right tool for the job
You may try to hammer a nail into the wall, but a screwdriver or drill will work much better. Using the correct application is key, knowing what tool is best to use and for what purpose is the tricky part.
Take a look at cut vs sed vs awk:
time apt-cache search python | (below) | ||
cut -d \ -f 1 | awk ‘{print $1}’ | sed ‘s/\(.*\?\)\ \-\ \(.*\)/\1/’ |
real 0m0.766s user 0m0.692s sys 0m0.040s |
real 0m0.759s user 0m0.680s sys 0m0.052s |
real 0m0.864s user 0m0.804s sys 0m0.012s |
Perhaps this one isn’t fair. The sed expression doesn’t do it properly. If some one with uber sed-fu can do it a better way, let me know in the comments and I’ll bench mark it. This is the closest to awk and cut that I came up with so this is whats represented for now.
Use Better Syntax
wc -l file.txt | cat file.txt | wc -l |
real 0m0.009s user 0m0.000s sys 0m0.008s |
real 0m0.018s user 0m0.001s sys 0m0.025s |
The same goes for a lot of other programs including grep:
grep test file.txt | cat file.txt | grep test |
real 0m0.009s | real 0m0.017s |
Process in Parallel
If performance is a huge concern, you could consider performing operations in parallel. Projects like distcc achieve awesomely fast compile times when distributing load over a number of hosts. You’d think that using this kind of technique with shell scripts would result in a considerable performance boost. From my testing it seems to produce varying results, you can use the techniques outlined here: http://pebblesinthesand.wordpress.com/2008/05/22/a-srcipt-for-running-processes-in-parallel-in-bash/ to see if your scripts can benefit from parallel processing.
Consider Changing Languages
This may be sacrilegious to die hard scripters but I’ve written before about when not to script it, this was more along the lines of why make something so complex when a shell script can do it? When performance matters is the answer.
Simple test to echo hello 10,000 times
C++ | Bash (time for i in `seq 1 10000`; do echo Hello; done;) |
Python | PHP | Perl | Java |
real 0m0.039s user 0m0.004s sys 0m0.024s |
real 0m0.202s user 0m0.144s sys 0m0.040s |
real 0m0.058s user 0m0.032s sys 0m0.024s |
real 0m0.057s user 0m0.028s sys 0m0.016s |
real 0m0.043s user 0m0.000s sys 0m0.028s |
real 0m0.212s user 0m0.124s sys 0m0.052s |