Fetching Online Data From Command Line
Shell Scripts can come in handy for processing or re-formatting data that is available from the web. There are lots of tools available to automate the fetching of pages instead of downloading each page individually.
The first two programs I’m demonstrating for fetching are links and lynx. They are both shell browsers, meaning that they need no graphical user interface to operate.
Curl is a program that is used to transfer data to or from a server. It supports many protocols, but for the purpose of this article I will only be showing the http protocol.
The last method (shown in other blog posts) is wget. wget also fetches files from many protocols. The difference between curl and wget is that curl by default dumps the data to stdout where wget by default writes the file to the remote filename.
Essentially the following do the exact same thing:
owen@linux-blog-:~$ lynx http://www.thelinuxblog.com -source > lynx-source.html owen@linux-blog-:~$ links http://www.thelinuxblog.com -source > links-source.html owen@linux-blog-:~$ curl http://www.thelinuxblog.com > curl.html |
Apart from the shell browser interface links and lynx also have some differences that may not be visible to the end user.
Both lynx and links re-format the code received into a format that they understand better. The method of doing this is -dump. They both format it differently so which ever one is easier for you to parse I would recommend using. Take the following:
owen@linux-blog-:~$ lynx -dump http://www.thelinuxblog.com > lynx-dump.html owen@linux-blog-:~$ links -dump http://www.thelinuxblog.com > links-dump.html owen@linux-blog-:~$ md5sum links-dump.html 8685d0beeb68c3b25fba20ca4209645e links-dump.html owen@linux-blog-:~$ md5sum lynx-dump.html beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html |
The md5 indicates that the dumped HTML is different.
wget does the same thing (as curl, links -source and lynx -source) but will create the local file with the the remote filename like so:
owen@linux-blog-:~$ wget http://www.thelinuxblog.com --17:51:21-- http://www.thelinuxblog.com/ => `index.html' Resolving www.thelinuxblog.com... 72.9.151.51 Connecting to www.thelinuxblog.com|72.9.151.51|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html][ <=> ] 41,045 162.48K/s 17:51:22 (162.33 KB/s) - `index.html' saved [41045] owen@linux-blog-:~$ ls index.html |
Here is the result md5sum on all of the files in the directory:
owen@linux-blog-:~$ for i in $(ls); do md5sum $i; done; a791a9baff48dfda6eb85e0e6200f80f curl.html a791a9baff48dfda6eb85e0e6200f80f index.html 8685d0beeb68c3b25fba20ca4209645e links-dump.html a791a9baff48dfda6eb85e0e6200f80f links-source.html beb4f9042a236c6b773a1cd8027fe252 lynx-dump.html a791a9baff48dfda6eb85e0e6200f80f lynx-source.html |
Note: index.php is wget’s output.
Where ever the sum matches, the output is the same.
What do I like to use?
Although all of the methods (excluding dump) produce the same results I personally like to use curl because I am familiar with the syntax. It handles variables, cookies, encryption and compression extremely well. The user agent is easy to change. The last winning point for me is that it has a PHP extension which is nice to avoid using system calls to the other methods.