Linux Blog

Fetching Online Data From Command Line

Filed under: Shell Script Sundays — at 6:12 pm on Sunday, December 2, 2007

Shell Scripts can come in handy for processing or re-formatting data that is available from the web. There are lots of tools available to automate the fetching of pages instead of downloading each page individually.

The first two programs I’m demonstrating for fetching are links and lynx. They are both shell browsers, meaning that they need no graphical user interface to operate.

Curl is a program that is used to transfer data to or from a server. It supports many protocols, but for the purpose of this article I will only be showing the http protocol.

The last method (shown in other blog posts) is wget. wget also fetches files from many protocols. The difference between curl and wget is that curl by default dumps the data to stdout where wget by default writes the file to the remote filename.

Essentially the following do the exact same thing:

 owen@linux-blog-:~$ lynx -source > lynx-source.html
owen@linux-blog-:~$ links -source > links-source.html
owen@linux-blog-:~$ curl > curl.html

Apart from the shell browser interface links and lynx also have some differences that may not be visible to the end user.
Both lynx and links re-format the code received into a format that they understand better. The method of doing this is -dump. They both format it differently so which ever one is easier for you to parse I would recommend using. Take the following:

 owen@linux-blog-:~$ lynx -dump > lynx-dump.html
owen@linux-blog-:~$ links -dump > links-dump.html
owen@linux-blog-:~$ md5sum links-dump.html
8685d0beeb68c3b25fba20ca4209645e  links-dump.html
owen@linux-blog-:~$ md5sum lynx-dump.html
beb4f9042a236c6b773a1cd8027fe252  lynx-dump.html

The md5 indicates that the dumped HTML is different.

wget does the same thing (as curl, links -source and lynx -source) but will create the local file with the the remote filename like so:

 owen@linux-blog-:~$ wget
=> `index.html'
Connecting to||:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html][  <=>                                ] 41,045       162.48K/s
17:51:22 (162.33 KB/s) - `index.html' saved [41045]
owen@linux-blog-:~$ ls

Here is the result md5sum on all of the files in the directory:

 owen@linux-blog-:~$ for i in $(ls); do md5sum $i; done;
a791a9baff48dfda6eb85e0e6200f80f  curl.html
a791a9baff48dfda6eb85e0e6200f80f  index.html
8685d0beeb68c3b25fba20ca4209645e  links-dump.html
a791a9baff48dfda6eb85e0e6200f80f  links-source.html
beb4f9042a236c6b773a1cd8027fe252  lynx-dump.html
a791a9baff48dfda6eb85e0e6200f80f  lynx-source.html

Note: index.php is wget’s output.
Where ever the sum matches, the output is the same.

What do I like to use?
Although all of the methods (excluding dump) produce the same results I personally like to use curl because I am familiar with the syntax. It handles variables, cookies, encryption and compression extremely well. The user agent is easy to change. The last winning point for me is that it has a PHP extension which is nice to avoid using system calls to the other methods.