Shell Scripts can come in handy for processing or re-formatting data that is available from the web. There are lots of tools available to automate the fetching of pages instead of downloading each page individually.
The first two programs I’m demonstrating for fetching are links and lynx. They are both shell browsers, meaning that they need no graphical user interface to operate.
Curl is a program that is used to transfer data to or from a server. It supports many protocols, but for the purpose of this article I will only be showing the http protocol.
The last method (shown in other blog posts) is wget. wget also fetches files from many protocols. The difference between curl and wget is that curl by default dumps the data to stdout where wget by default writes the file to the remote filename.
Essentially the following do the exact same thing:
owen@linux-blog-:~$ lynx http://www.thelinuxblog.com -source > lynx-source.html
owen@linux-blog-:~$ links http://www.thelinuxblog.com -source > links-source.html
owen@linux-blog-:~$ curl http://www.thelinuxblog.com > curl.html
Apart from the shell browser interface links and lynx also have some differences that may not be visible to the end user.
Both lynx and links re-format the code received into a format that they understand better. The method of doing this is -dump. They both format it differently so which ever one is easier for you to parse I would recommend using. Take the following:
owen@linux-blog-:~$ lynx -dump http://www.thelinuxblog.com > lynx-dump.html
owen@linux-blog-:~$ links -dump http://www.thelinuxblog.com > links-dump.html
owen@linux-blog-:~$ md5sum links-dump.html
owen@linux-blog-:~$ md5sum lynx-dump.html
The md5 indicates that the dumped HTML is different.
wget does the same thing (as curl, links -source and lynx -source) but will create the local file with the the remote filename like so:
owen@linux-blog-:~$ wget http://www.thelinuxblog.com
Resolving www.thelinuxblog.com… 184.108.40.206
Connecting to www.thelinuxblog.com|220.127.116.11|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html][ <=> ] 41,045 162.48K/s
17:51:22 (162.33 KB/s) – `index.html’ saved 
Here is the result md5sum on all of the files in the directory:
owen@linux-blog-:~$ for i in $(ls); do md5sum $i; done;
Note: index.php is wget’s output.
Where ever the sum matches, the output is the same.
What do I like to use?
Although all of the methods (excluding dump) produce the same results I personally like to use curl because I am familiar with the syntax. It handles variables, cookies, encryption and compression extremely well. The user agent is easy to change. The last winning point for me is that it has a PHP extension which is nice to avoid using system calls to the other methods.