Shell Scripts can come in handy for processing or re-formatting data that is available from the web. There are lots of tools available to automate the fetching of pages instead of downloading each page individually.
The first two programs I’m demonstrating for fetching are links and lynx. They are both shell browsers, meaning that they need no graphical user interface to operate.
Curl is a program that is used to transfer data to or from a server. It supports many protocols, but for the purpose of this article I will only be showing the http protocol.
The last method (shown in other blog posts) is wget. wget also fetches files from many protocols. The difference between curl and wget is that curl by default dumps the data to stdout where wget by default writes the file to the remote filename.
Essentially the following do the exact same thing:
owen@linux-blog-:~$ lynx http://www.thelinuxblog.com -source > lynx-source.html
owen@linux-blog-:~$ links http://www.thelinuxblog.com -source > links-source.html
owen@linux-blog-:~$ curl http://www.thelinuxblog.com > curl.html |
owen@linux-blog-:~$ lynx http://www.thelinuxblog.com -source > lynx-source.html
owen@linux-blog-:~$ links http://www.thelinuxblog.com -source > links-source.html
owen@linux-blog-:~$ curl http://www.thelinuxblog.com > curl.html
Apart from the shell browser interface links and lynx also have some differences that may not be visible to the end user.
Both lynx and links re-format the code received into a format that they understand better. The method of doing this is -dump. They both format it differently so which ever one is easier for you to parse I would recommend using. Take the following:
owen@linux-blog-:~$ lynx -dump http://www.thelinuxblog.com > lynx-dump.html
owen@linux-blog-:~$ links -dump http://www.thelinuxblog.com > links-dump.html
owen@linux-blog-:~$ md5sum links-dump.html
8685d0beeb68c3b25fba20ca4209645e links-dump.html
owen@linux-blog-:~$ md5sum lynx-dump.html
beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html |
owen@linux-blog-:~$ lynx -dump http://www.thelinuxblog.com > lynx-dump.html
owen@linux-blog-:~$ links -dump http://www.thelinuxblog.com > links-dump.html
owen@linux-blog-:~$ md5sum links-dump.html
8685d0beeb68c3b25fba20ca4209645e links-dump.html
owen@linux-blog-:~$ md5sum lynx-dump.html
beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html
The md5 indicates that the dumped HTML is different.
wget does the same thing (as curl, links -source and lynx -source) but will create the local file with the the remote filename like so:
owen@linux-blog-:~$ wget http://www.thelinuxblog.com
--17:51:21-- http://www.thelinuxblog.com/
=> `index.html'
Resolving www.thelinuxblog.com... 72.9.151.51
Connecting to www.thelinuxblog.com|72.9.151.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html][ <=> ] 41,045 162.48K/s
17:51:22 (162.33 KB/s) - `index.html' saved [41045]
owen@linux-blog-:~$ ls
index.html |
owen@linux-blog-:~$ wget http://www.thelinuxblog.com
--17:51:21-- http://www.thelinuxblog.com/
=> `index.html'
Resolving www.thelinuxblog.com... 72.9.151.51
Connecting to www.thelinuxblog.com|72.9.151.51|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html][ <=> ] 41,045 162.48K/s
17:51:22 (162.33 KB/s) - `index.html' saved [41045]
owen@linux-blog-:~$ ls
index.html
Here is the result md5sum on all of the files in the directory:
owen@linux-blog-:~$ for i in $(ls); do md5sum $i; done;
a791a9baff48dfda6eb85e0e6200f80f curl.html
a791a9baff48dfda6eb85e0e6200f80f index.html
8685d0beeb68c3b25fba20ca4209645e links-dump.html
a791a9baff48dfda6eb85e0e6200f80f links-source.html
beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html
a791a9baff48dfda6eb85e0e6200f80f lynx-source.html |
owen@linux-blog-:~$ for i in $(ls); do md5sum $i; done;
a791a9baff48dfda6eb85e0e6200f80f curl.html
a791a9baff48dfda6eb85e0e6200f80f index.html
8685d0beeb68c3b25fba20ca4209645e links-dump.html
a791a9baff48dfda6eb85e0e6200f80f links-source.html
beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html
a791a9baff48dfda6eb85e0e6200f80f lynx-source.html
Note: index.php is wget’s output.
Where ever the sum matches, the output is the same.
What do I like to use?
Although all of the methods (excluding dump) produce the same results I personally like to use curl because I am familiar with the syntax. It handles variables, cookies, encryption and compression extremely well. The user agent is easy to change. The last winning point for me is that it has a PHP extension which is nice to avoid using system calls to the other methods.