Linux Blog

Using BASH to sort a book collection. ISBN Data Mining – Part 2

Filed under: Shell Script Sundays — TheLinuxBlog.com at 8:49 pm on Sunday, September 23, 2007

Last weeks article used a bit of data mining to grab information on ISBN’s from Barnes and Noble and dumped the information into a text file. This week we’ll take those text files and format the data for input into a MySQL database. This tutorial assumes that you are familiar with MySQL.

The data that was written to the text files was in the following format:

Title: DNS and BIND, Third Edition
Title:
Author(s): Paul Albitz, Mike Loukides (Editor)
Publisher: O’Reilly Media, Incorporated
Pub. Date: September 1998
ISBN: 1565925122
Format: Paperback, 480pp
Edition Number: 3
Edition Description: Older Edition
Customer Rating:

This can easily be parsed and formatted for insertion into a MySQL table.

Firstly a database has to be created and then a table structure has to be decided upon. Since this example already has the titles, I’ll just use simular ones for the field names.
Create a database called book_info:

mysqladmin -u root create book_info;

and now create a table within the book_info database that is to contain all of the data:

CREATE TABLE `books` (
`ISBN` INT( 10 ) NOT NULL ,
`title` VARCHAR( 50 ) NOT NULL ,
`author` VARCHAR( 80 ) NOT NULL ,
`publisher` VARCHAR( 50 ) NOT NULL ,
`pub_date` VARCHAR( 50 ) NOT NULL ,
`format` VARCHAR( 30 ) NOT NULL ,
`edition` INT( 2 ) NOT NULL ,
INDEX ( `ISBN` )
) ENGINE = MYISAM ;

This isn’t the best MySQL table structure ever, but it will do for the purposes of this artice and besides it can always be tuned later.

With a directory full of .txt’s files from the last issue of shell script sundays the following can be ran to create a text file called bookQuery.sql.

for i in $(ls); do
echo “INSERT INTO \`book_info\`.\`books\` (\`ISBN\`, \`title\`, \`author\`, \`publisher\`, \`pub_date\`, \`format\`, \`edition\`) VALUES (‘$(cat $i | grep ISBN:)’, ‘$(cat $i | grep Title | sed ‘q’ | sed “s/’/\\\\’/”)’, ‘$(cat $i | grep Author\(s\): | sed “s/’/\\\\’/”)’, ‘$(cat $i | grep Publisher: |sed “s/’/\\\\’/”)’, ‘$(cat $i | grep Date:| sed “s/’/\\\\’/”)’, ‘$(cat $i | grep Format: | sed “s/’/\\\\’/”)’, ‘$(cat $i | grep Edition\ Number: | sed “s/’/\\\\’/”)’);” >> bookQuery.sql; done;

In turn this file can be imported into the table that was created by running the following:

mysql -u root < bookQuery.sql

Whats happening is pretty simple, cat reads the file and grep is used to find the line of text we want to import. After that sed is used. It is used twice in the title field. The first time is to use the first title from the text file. It is also used on every other appropriate field to escape the string so that it does not break the query. This example does not take the titles out of the line, but this could be easily done with cut.

Its easy to import text files into MySQL with shell scripting but the language I feel is best suited for this task is PHP. Some time I’ll go over how to do this with PHP.

Using BASH to sort a book collection. ISBN Data Mining – Part 1

Filed under: General Linux,Shell Script Sundays — TheLinuxBlog.com at 2:47 am on Sunday, September 16, 2007

Many problems can be solved with a little bit of shell scripting.
This week I plan to show you a script that does a little data mining from Barnes and Noble.
I have a lot of books and wanted cataloged information on them. Each book has a unique identifier called an ISBN. So I collected all of my ISBN numbers and a simple loop that wrapped around a script a friend of mine made to find basic information.
Here is his script:

#!/bin/bash
ISBN=”$1″

function fetchInfo () {
### Using barnesandnoble.com to fetch info…
lynx -source “http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ISBN=${ISBN}” |\
tr -d ‘[:cntrl:]‘ | sed ‘s/>\n

### Parsing book title.
if [ "${lineArray[0]}” == ”
echo “4|Title: ${lineArray[*]}” | sed ‘s/<[^>]*>//g;s/ ([^)]*)//g’### Parsing book author.
elif [ "$(echo ${lineArray[*]} | grep “id=\”contributor\”")” ]; then
echo “3|Author(s): ${lineArray[*]}” | sed ‘s/by //;s/<[^>]*>//g’

### Parsing additional data.
elif [ "${lineArray[0]}” == ”
[ "$(echo ${lineArray[*]} | grep -ve “bullet” -ve “title”)” ]; then
echo “1|${lineArray[*]}” | sed ‘s/<[^>]*>//g;s/:/: /;s/ / /’
fidone | sort -ur | awk -F\| ‘{print $2}’ | grep “:”

}

if [ "${#ISBN}" -ge "10" ]; then
fetchInfo
fi

The script should be called as followed (assuming the script name is eBook.sh):

sh eBook.sh ISBNNUMBER

The first step is to see if the ISBN is greater than 10 characters, if it is it goes to the fetchInfo() function.
It then takes the given ISBN number and searches the barnsandnoble.com site for any matches. To do this lynx is used, the -source option tells lynx to output the source instead of using browsing mode. The output of lynx is piped to tr and sed. tr is used to delete all line breaks from the source, the sed expression adds a line break at the end of each HTML tag. The while loop loops over each line from from the piped lynx, tr and sed.
Within the loop is where anything from the output of the search page can be pulled out. This script pulls out the book title, the author and additional data.

I formatted my ISBN’s in a text list and used the following loop to fetch information on my books and save them with the ISBN as the file name.

for i in $(cat list.txt); do sh eBook.sh $i > $i.txt; done;

In the next issue I plan to expand on this to format data in an even more presentable manor.
Used applications
tr, lynx, sed, awk, sort, grep