SQL Database

One of the most common things Perl is used for when it comes to raw data files is importing them into databases. To import into databases in Perl, you need a special module. This is normally the DBI module, and you’ll access it by putting a “use” line at the top of your script. This tells Perl that you want to access an external module for more functionality.

use DBI;

$dbh = DBI->connect("dbi:SQLite:dbname=TESTDB");

Put this just below your first set of comments, and then run

./show

If it shows you your help as normal, then you have both DBI and SQLite installed on your system.

If it gives you the error “Can’t locate DBI.pm in ...”, you’ll need to install DBI and SQLite.

If it gives you the error “install_driver(SQLite) failed: Can't locate DBD/SQLite.pm in @INC...”, this means that you have DBI but you’ll need to install SQLite.

Installing from CPAN

Fortunately, Perl has an easy way to install modules. Unfortunately, it isn’t so easy if you aren’t the system administrator. If you aren’t the system administrator, you’ll need to ask your system administrator to install it for you. If you are, you can do it yourself.

Your system must have a compiler in order to install these Perl modules. You don’t have to know how to use it. Perl will handle that for you. But you will need it. Many Unix-like operating systems come with the compiler pre-installed. If you are using Mac OS X, you’ll need to install XCode from your install CD or DVD.

These instructions assume that your account is an administrative one, where “administrative” means that it can use the “sudo” command. Type the following from the command line to enter the Perl shell:

sudo perl -MCPAN -e shell

The first time you enter CPAN it is going to ask a whole bunch of questions. You can choose the default for all of them. Choose the default by pressing the return key. Note that some questions, such as “Fetching with Net::FTP:” will take a while. It is downloading information from the Comprehensive Perl Archive Network. This is a centrally distributed archive for Perl modules (among other things).

The ones you will have to answer are your continent, your country, and a list of URLs to download from. (Make sure you choose the URLs by number, and separate them by spaces. Just read the directions for that question when it comes up.)

Once cpan is done configuring itself, we can install the modules we need. If you need to install the DBI module, type:

install DBI

Press return after you type that. If you need SQLite, then, when DBI has finished installing type:

install DBD::SQLite

Perl will attempt to install what it needs for these modules. It will occasionally ask for permission to download prerequisites. Generally you’ll want to say no, but read http://sial.org/howto/perl/life-with-cpan/ for specifics. You shouldn’t need anything except DBI and DBD::SQLite for this section of the tutorial.

See http://search.cpan.org/dist/DBI/lib/DBD/DBM.pm for more information about the DBM section of the DBI module for Perl.

Using SQLite

Go ahead and remove that “$dbh = ...” line that we added for testing. Leave “use DBI;” there.

What we would like to do is import this data from its file into a database. First, add the switch:

} elsif ($switch eq "import") {
if ($importDB = shift) {
$importHandle = DBI->connect("dbi:SQLite:dbname=$importDB");
if (!$importHandle) {
print "Unable to open database $importDB: $!\n";
}
} else {
help("The import switch requires a database name to import to.");
}

The --import switch will require the name of a database. As soon as we get the database name, we try to open it:

$importHandle = DBI->connect("dbi:SQLite:dbname=$importDB");

This is a bit of a different format than we’ve seen before. DBI is an object. This is a special programming tool that contains code, called methods, that we can access on that object. One of the methods is the connect method. It connects to the database. Methods are very much like the subroutines we’ve already been using. They are almost always accessed using object->method(parameters).

If DBI can’t open the database, we exit and try to print an error message.

Add the help:

print "\t--import <database name>: import data into a named database\n";

You might go ahead and type “./show --import” and make sure that the help is displayed as expected.

Creating tables

Now we know how to open the database. SQL databases require tables. We need to create one if it does not already exist. Let’s go ahead and just make a subroutine to do this.

#expects a handle to a database connection, a name for a table, and an associative
#array of field names and field types passed as a reference
sub createTable {
my($dbHandle) = shift;
my($tableName) = shift;
my($tableFields) = shift;

my(@fieldCreators, $createQuery, $queryHandle, $fieldType, $fieldName);

@tables = $importHandle->tables();
if (!grep(/^"$dbTable"$/, @tables)) {
#construct query to create table
foreach $fieldName (keys %$tableFields) {
$fieldType = $$tableFields{$fieldName};
$fieldCreators[$#fieldCreators+1] = "$fieldName $fieldType";
}
$createQuery = "CREATE TABLE $tableName (" . join(", ", @fieldCreators) . ")";

#tell database to create the table
if ($queryHandle = $dbHandle->prepare($createQuery)) {
if (!$queryHandle->execute) {
print "Unable to execute $createQuery: $!\n";
}
} else {
print "Unable to prepare $createQuery: $!\n";
exit;
}
}
}

There are two sections to this subroutine. The first one creates the query that will create the table. The second one hands that query off to the database. When you use a SQL query in Perl, you usually hand it off in two steps: first, you prepare it on the handle. This creates a “statement handle” for the query statement. Then, you execute it from that statement handle. Since both of these handles are objects with methods, we use the handle->method syntax to prepare and to execute the query.

Pay close attention to the third parameter that this subroutine is expecting: it’s an associative array. We can’t pass associative arrays to subroutines in Perl. We can only pass scalar variables and lists of scalar variables. What we’ll need to do when we call this subroutine is pass that associative array by reference.

If you recall when we were sorting, Perl handed our sort subroutine associative arrays by hard reference. Well, we can hand hard references out to our subroutines, too.

Change the “$importHandle=” section of the switch for imports. We’ll call the createTable subroutine as soon as we open the database.

if ($importHandle = DBI->connect("dbi:SQLite:dbname=$importDB")) {
createTable($importHandle, $dbTable, \%dbFields);
} else {
print "Unable to open database $importDB: $!\n";
}

When you want to pass a variable as a hard reference, precede the variable name with a backslash.

We also need to create the %dbFields associative array. At the top, add the following defaults:

#name of table
$dbTable = "music";
#what kind of fields need to be in the database?
%dbFields = (
"ID"=>"INTEGER PRIMARY KEY AUTOINCREMENT",
"song"=>"TEXT",
"duration"=>"TEXT",
"artist"=>"TEXT",
"album"=>"TEXT",
"year"=>"INTEGER",
"rating"=>"INTEGER",
"ripdate"=>"TEXT",
"genre"=>"TEXT"
);

Note that we are putting this “single line” of Perl code onto several lines. Perl only cares about that semicolon. That’s how Perl sees the end of a line of code. Separating the associations into separate lines makes it easier for you, as programmer, to read them.

Also, this is the first time we’ve created an associative array by specifying the keys and values all at once. It is similar to how we create simple arrays, but each item is a key and a value, separated by “=>”. The key is on the left and the value is on the right.

Inserting data

Okay, we’ve opened the database and we’ve created the table. Now we need to insert data into the table. It is time to add functionality to our --import switch.

First, let’s add a subroutine called “insertRow”. This subroutine will be similar to the “createTable” subroutine, but it will insert rows of data. We’ll pass it the same data: a handle to the database, the name of the table that will receive the data, and the associative array of fieldnames as a hard reference.

This subroutine will also expect that the field names in that associative array match an existing scalar variable with the same name. So if there is a field named “artist”, this subroutine will look in $artist for that field’s value.

#expects a handle to a database connection, a name for a table, and an associative
#array of field names and field types passed as a reference
#also expects that the field names match a currently-existing scalar variable
sub insertRow {
my($dbHandle) = shift;
my($tableName) = shift;
my($tableFields) = shift;

my(@fieldNames, @fieldValues, $fieldNames, $fieldValues, $fieldValue);
my($insertQuery, $queryHandle);

#create the field list and value list
@fieldNames = keys %$tableFields;
foreach $fieldName (@fieldNames) {
$fieldValue = $$fieldName;
#set $fieldValue to an appropriate SQL value
if ($fieldValue eq "") {
$fieldValue = NULL;
} else {
$fieldValue = '$dbHandle->quote($$fieldname);
}
$fieldValues[$#fieldValues+1] = $fieldValue;
}
$fieldNames = join(", ", @fieldNames);
$fieldValues = join(", ", @fieldValues);
$insertQuery = "INSERT INTO $tableName ($fieldNames) VALUES ($fieldValues)";

#insert into the database
doQuery($dbHandle, $insertQuery);
}

If you aren’t familiar with SQL, this will create queries such as “INSERT INTO music (artist, album, year) VALUES ("Foreigner", "4", NULL);”. If a field doesn’t have a value--if it equals the empty string ""--then we set the value to NULL. Otherwise, we tell the database to correctly quote this piece of text. We need to do this because SQL requires text to be surrounded by quotes. But some text contains quotes. The quote method on the database handle understands this, and fixes the text accordingly.

Once we have the query created, executing the query is the same as executing the CREATE TABLE query. Rather than have the same code duplicated for each kind of query, we can make a subroutine to handle it. The subroutine will need the database handle and the query.

sub doQuery {
my($dbHandle) = shift;
my($query) = shift;

if ($queryHandle = $dbHandle->prepare($query)) {
if (!$queryHandle->execute) {
print "Unable to execute $query: $!\n";
}
} else {
print "Unable to prepare $query: $!\n";
exit;
}
}

You should go replace the “if $queryHandle =” section of the createTable subroutine so that it calls

#tell database to create the table
doQuery($dbHandle, $createQuery);

Now would be a good time to try “./show” with no parameters to see if you have any errors. It should just give you the help message.

Finally, we need to call this insertRow subroutine for every matching row. At the end of the “if ($matched) {” section, add:

#import into a database?
if ($importDB) {
insertRow($importHandle, $dbTable, \%dbFields);
}

Let’s see if we can make a database of love songs:

./show --song love --import LoveSongs songs.txt

This works, but it is slow. SQLite can be much faster than this. What we need to tell it to do is to wait until we’re done feeding it rows before it writes them all out to the disk. After the createTable call in our switches area, add:

$importHandle->{AutoCommit} = 0;

This tells SQLite not to automatically write rows out to the disk as soon as it receives them. We’ll need to explicitly tell it when to “commit” our changes. Just in front of “if (%artists) {” and just at the end of the loop through all lines, add:

#commit insertions if we are importing into a database
$importHandle->commit;

Go ahead and remove the file LoveSongs, and run that again:

./show --song love --import LoveSongs songs.txt

It should go much faster.

How do you know the data has been inserted? If you have sqlite3 on your system, you can type “sqlite3 databasename” to open the database by hand. Then, type:

SELECT * FROM music;

You should see a long list of all of the data that’s been imported. Use “.quit” to exit.

If you don’t have sqlite3 installed on your system, you’ll just have to wait until the next section.

The final script

This is the end of this script. We’ll start from scratch on the next one, and display our data.

#!/usr/bin/perl
#Search for songs in a file of the following tab-separated data:
# title, duration, artist, album, year, rating, rip date, track position, genre

use DBI;

#options for the --format switch
@validFormats = ("raw", "simple", "html", "summary");
$validFormats = englishJoin(", ", "or", @validFormats);

#options for fields to search in
@validFields = ("artist", "album", "song", "genre");
$validFields = englishJoin(", ", "and", @validFields);

#database information
#name of table
$dbTable = "music";
#what kind of fields need to be in the database?
%dbFields = (
"ID"=>"INTEGER PRIMARY KEY AUTOINCREMENT",
"song"=>"TEXT",
"duration"=>"TEXT",
"artist"=>"TEXT",
"album"=>"TEXT",
"year"=>"INTEGER",
"rating"=>"INTEGER",
"ripdate"=>"TEXT",
"genre"=>"TEXT"
);


#strip off the command-line switches and act on or remember them
while ($ARGV[0] =~ /^--(.+)/) {
$switch = $1;

#pull this switch off of the front of the list
shift;

#if they ask for help, do it and exit
if ($switch eq "help") {
help();
} elsif ($switch eq "case") {
$sensitive = 1;
} elsif ($switch eq "reverse") {
$reverse = 1;
} elsif ($switch eq "limit") {
$limit = shift;
if ($limit !~ /^[1-9][0-9]*$/) {
help("You must limit to a number, such as '33' or '2'.");
}
} elsif ($switch eq "format") {
$format = shift;
if (!grep(/^$format$/, @validFormats)) {
help("Format must be $validFormats.");
}
} elsif (grep(/^$switch$/, @validFields)) {
if ($searchText = shift) {
$searches{$switch} = $searchText;
} else {
help("Searching in $switch requires text to search on.");
}
} elsif ($switch eq "sort") {
$sortby = shift;
if (!grep(/^$sortby$/, @validFields)) {
help("I can only sort by $validFields.");
}
} elsif ($switch eq "exact") {
$exact = 1;
} elsif ($switch eq "export") {
$exportField = shift;
if (!grep(/^$exportField$/, @validFields)) {
help("I can only export by $validFields.");
}
} elsif ($switch eq "folder") {
if ($exportFolder = shift) {
if (-e $exportFolder) {
#if the folder exists, it needs to be a folder
help("$exportFolder already exists and is not a folder.") if !-d $exportFolder;
}
} else {
help("The folder option requires a folder name.");
}
} elsif ($switch eq "keep-time") {
$keepTime = 1;
} elsif ($switch eq "import") {
if ($importDB = shift) {
if ($importHandle = DBI->connect("dbi:SQLite:dbname=$importDB")) {
createTable($importHandle, $dbTable, \%dbFields);
$importHandle->{AutoCommit} = 0;
} else {
print "Unable to open database $importDB: $!\n";
}
} else {
help("The import switch requires a database name to import to.");
}
} else {
help("I do not understand the option '$switch'.");
}
}

#the first item on the command line is what we're searching for
if (%searches) {
#if we're looking for exact matches, set them up ahead of time
if ($exact) {
foreach $search (keys %searches) {
$searchText = $searches{$search};
$searches{$search} = "^$searchText\$";
}
}
while (<>) {
#split out the song information
chomp;
($song, $duration, $artist, $album, $year, $rating, $ripdate, $track, $genre) = split("\t");

foreach $searchField (keys %searches) {
$needle = $searches{$searchField};
$haystack = $$searchField;
$matched = match($haystack, $needle);
last if !$matched;
}

#reverse the match if we want non-matching lines
if ($reverse) {
$matched = !$matched;
}

#print the information if this line is one we want
if ($matched) {
#maintain the timestamp if we need it
if ($keepTime && $lastFile ne $ARGV) {
@fileInfo = stat($ARGV);
$fileMod = $fileInfo[9];
$lastModified = $fileMod if $fileMod > $lastModified;
$lastFile = $ARGV;
}

$matches++;
if ($format eq "raw") {
$text = "$_\n";
} elsif ($format eq "html") {
$text = "<tr><td>$song</td><td>$album</td><td>$artist</td></tr>\n";
} elsif ($format eq "summary") {
$artists{$artist}++;
} else {
$text = "$song ($album, by $artist)\n";
}

#store or print the display text and the sort text
if ($sortby || $exportField) {
$matches[$#matches+1]{'text'} = $text;
$matches[$#matches]{'sort'} = $$sortby if $sortby;
if ($exportField) {
$filename = $$exportField;
if ($filename ne "") {
$filename = replace($filename, "/", "-");
} else {
$filename = "Unknown $exportField";
}
$matches[$#matches]{'file'} = $filename;
}
} else {
print $text;
}

#import into a database?
if ($importDB) {
insertRow($importHandle, $dbTable, \%dbFields);
}
}
last if $limit && $matches >= $limit;
}

#commit insertions if we are importing into a database
$importHandle->commit;

if (%artists) {
@artists = keys %artists;
@artists = sort byArtistCount @artists;
foreach $artist (@artists) {
$artistCount = $artists{$artist};
print "$artist: $artistCount\n";
}
} elsif (@matches) {
@matches = sort byCustom @matches if $sortby;

#create a folder if necessary, and move into it
if ($exportField && $exportFolder) {
if (!-e $exportFolder) {
if (!mkdir($exportFolder)) {
print "Unable to create $exportFolder: $!\n";
exit;
}
}
if (!chdir($exportFolder)) {
print "Unable to move into $exportFolder: $!\n";
exit;
}
}

foreach $match (@matches) {
if ($exportField) {
$filename = $$match{'file'};
#open the file if we haven't already
if (!$files{$filename}) {
if (!open($files{$filename}, ">$filename")) {
print "Unable to open $filename: $!\n";
exit;
}
}
$filehandle = $files{$filename};
print $filehandle $$match{'text'};
} else {
print $$match{'text'};
}
}

#close all open files
foreach $filename (keys %files) {
$filehandle = $files{$filename};
close($filehandle);
utime($lastModified, time(), $filename);
}
}
} else {
help();
}

#describe how this script is used
sub help {
#if there is an error message, print it out separated from the rest
if (my($message) = shift) {
print "\n$message\n\n";
}

print "Syntax: show [options] [song files]\n";
print "\tSearch for some text in the song file. If no song file is specified\n";
print "\t'show' will expect it on standard input.\n";
print "\tA song file is a tab-delimited file with:\n";
print "\ttitle, duration, artist, album, year, rating, rip date, track position, genre\n";
print "\t--help: print this help text\n";
print "\t--case: be sensitive to upper and lower case\n";
print "\t--reverse: filter out songs that contain the search text\n";
print "\t--limit x: limit to x results\n";
print "\t--format <$validFormats>: choose format for results\n";
print "\t--$validFields <searchtext>: search in the $validFields field\n";
print "\t--sort <$validFields>: sort by specified field\n";
print "\t--exact: the search text must match exactly\n";
print "\t--export <$validFields>: export to files named after the specified field\n";
print "\t--folder <foldername>: export files are created in the specified folder\n";
print "\t--keep-time: keep the input file's timestamp on any exported files\n";
print "\t--import <database name>: import data into a named database\n";
print "At least one of the $validFields search requests must be specified.\n";

exit;
}

sub byArtistCount {
return $artists{$b} <=> $artists{$a};
}

sub englishJoin {
my($punctuation) = shift;
my($conjunction) = shift;
my(@items) = @_;

my($joined, $finalItem);

if ($#items == -1) {
$joined = "";
} elsif ($#items == 0) {
$joined = $items[0];
} elsif ($#items == 1) {
$joined = "$items[0] $conjunction $items[1]";
} else {
$finalItem = pop(@items);
$joined = join($punctuation, @items) . "$punctuation$conjunction $finalItem";
}

return $joined;
}

sub match {
my($searchIn) = shift;
my($searchFor) = shift;
my($matched) = 0;

if ($sensitive) {
$matched = $searchIn =~ /$searchFor/;
} else {
$matched = $searchIn =~ /$searchFor/i;
}

return $matched;
}

sub byCustom {
if ($sensitive) {
return $$a{'sort'} cmp $$b{'sort'};
} else {
return lc($$a{'sort'}) cmp lc($$b{'sort'});
}
}

sub replace {
my($text, $from, $to) = (shift, shift, shift);

$text =~ s/$from/$to/g;

return $text;
}

#expects a handle to a database connection, a name for a table, and an associative
#array of field names and field types passed as a reference
sub createTable {
my($dbHandle) = shift;
my($tableName) = shift;
my($tableFields) = shift;

my(@fieldCreators, $createQuery, $queryHandle, $fieldType);

@tables = $importHandle->tables();
if (!grep(/^"$dbTable"$/, @tables)) {
#construct query to create table
foreach $fieldName (keys %$tableFields) {
$fieldType = $$tableFields{$fieldName};
$fieldCreators[$#fieldCreators+1] = "$fieldName $fieldType";
}
$createQuery = "CREATE TABLE $tableName (" . join(", ", @fieldCreators) . ")";

#tell database to create the table
doQuery($dbHandle, $createQuery);
}
}

#expects a handle to a database connection, a name for a table, and an associative
#array of field names and field types passed as a reference
#also expects that the field names match a currently-existing scalar variable
sub insertRow {
my($dbHandle) = shift;
my($tableName) = shift;
my($tableFields) = shift;

my(@fieldNames, @fieldValues, $fieldNames, $fieldValues, $fieldValue);
my($insertQuery, $queryHandle);

#create the field list and value list
@fieldNames = keys %$tableFields;
foreach $fieldName (@fieldNames) {
$fieldValue = $$fieldName;
#set $fieldValue to an appropriate SQL value
if ($fieldValue eq "") {
$fieldValue = NULL;
} else {
$fieldValue = '$dbHandle->quote($$fieldname);
}
$fieldValues[$#fieldValues+1] = $fieldValue;
}
$fieldNames = join(", ", @fieldNames);
$fieldValues = join(", ", @fieldValues);
$insertQuery = "INSERT INTO $tableName ($fieldNames) VALUES ($fieldValues)";

#insert into the database
doQuery($dbHandle, $insertQuery);
}

sub doQuery {
my($dbHandle) = shift;
my($query) = shift;

if ($queryHandle = $dbHandle->prepare($query)) {
if (!$queryHandle->execute) {
print "Unable to execute $query: $!\n";
}
} else {
print "Unable to prepare $query: $!\n";
exit;
}
}