Unix-like operating systems provide an easy means of creating files from any program that has an output. Often, you won’t even need to worry about creating files, you’ll just redirect to a file and let the operating system handle it for you.
./show --exact --artist foreigner --format raw songs.txt > foreigner.txt
Because you can pipe directly from one program to another on the command line, you sometimes won’t even need to create files to store temporary data. If you want to count up how many songs Foreigner has in songs.txt, you can:
./show --exact --artist foreigner songs.txt | wc -l
Or, one of my favorites,
./show --exact --artist foreigner songs.txt | rev
But sometimes we do need to create our own files, and Perl makes this easy. Suppose we wanted to be able to create multiple files, perhaps one for each album, or one for each artist?
We can add a switch for this easily enough.
} elsif ($switch eq "export") {
$exportField = shift;
if (!grep(/^$exportField$/, @validFields)) {
print "\nI can only export by $validFields.\n\n";
help();
exit;
}
This switch is exactly like our sort switch. It accepts a valid field; if the user tries to export by something other than a valid field, the script will warn them and exit.
If the data is being sorted, we are going to have to wait until the end to export the files. So to make it easier, we’ll simply always wait until the end to export the files. This lets us re-use some of the code for sorting. Change:
if ($sortby) {
$matches[$#matches+1]{'text'} = $text;
$matches[$#matches]{'sort'} = $$sortby;
} else {
to:
if ($sortby || $exportField) {
$matches[$#matches+1]{'text'} = $text;
$matches[$#matches]{'sort'} = $$sortby if $sortby;
$matches[$#matches]{'file'} = $$exportField if $exportField;
} else {
The script will now remember the matches if either $sortby or $exportField has something in it. We only store the ‘sort’ association if $sortby has something in it, and we only store the ‘file’ association if $exportField has something in it. If $exportField is “album” and $album is “Head Games”, ‘file’ will associate with “Head Games” for this record.
So now we need to change the code that deals with @matches. Change this:
} elsif (@matches) {
@matches = sort byCustom @matches;
foreach $match (@matches) {
print $$match{'text'};
}
}
to:
} elsif (@matches) {
@matches = sort byCustom @matches if $sortby;
foreach $match (@matches) {
if ($exportField) {
$filename = $$match{'file'};
#open the file if we haven't already
if (!$files{$filename}) {
if (!open($files{$filename}, ">$filename")) {
print "Unable to open $filename: $!\n";
exit;
}
}
$filehandle = $files{$filename};
print $filehandle $$match{'text'};
} else {
print $$match{'text'};
}
}
#close all open files
foreach $filehandle (values %files) {
close($filehandle);
}
}
Note that in the second line we now only sort if $sortby has something in it. Otherwise, there’s nothing to sort on.
We’ve added a new section for “if ($exportField)”, so that if $exportField has something in it we will print to a file instead of to the “standard output” (usually the screen).
Before writing to a file, the file has to be “opened”. We need to get a “handle” on the file. Since we need to have a number of files opened it makes sense to store the file handles in an array. This script stores them in an associative array called %files, associating them with the filename.
Before opening the file with that filename, the script checks to see if there is already a handle associated with that filename in %files. The script only opens the file if there is not an existing handle associated with that filename.
If a file needs to be opened, the script opens it within an if, so that if there’s an error opening the file it can print an error and exit. Perl always stores the most recent error in a special variable called “$!”. So, if there’s a problem opening $filename, we have the script print “Unable to open $filename” and then “$!”. The error message is often very useful. For example, if you don’t have permission to open a file, the error message will say this.
The important new part is “open($files{$filename}, ">$filename")”. The open subroutine accepts two parameters. The first is the variable where we want to store the handle to the file. The second is the name of (or path to) the file we want to open. If we want to be able to write to the file, we need to prepend a greater than symbol to the filename. (We can also append to files by prepending two greater than symbols to the filename.)
So, if the script can successfully open the file, we now have a handle to it in $files{$filename}. All that remains is to get it (with “$filehandle = $files{$filename}”) and print to it.
If you look at some of the previous print commands, they have multiple variables or multiple pieces of text, separated by commas. Print can accept any number of pieces of text, separated by commas. However, if the first variable is not separated by the rest of the variables or text by a comma, print assumes that this is a handle to a file, and redirects its output to that file handle.
That’s why there is only a space between $filehandle and $$match{'text'} in “print $filehandle $$match{'text'}”.
Finally, after looping through all matches, we grab every value out of %files--each of which is a file handle--and close that file. The phrase “values %files” is the same as “keys %files” except that it gets a simple array of %file’s values, rather than a simple array of %file’s keys.
Perl will close files for us automatically as soon as the script ends or exits. But I like to close them explicitly as soon as they are no longer needed. Otherwise they hang around, open, until the script ends. Here that’s not a big deal but later on we might alter this script and add more functionality at the end. If that functionality involves opening files too, we might run up against the operating system’s limit: most operating systems limit the number of files any one program can open.
Having done all of this, we can now grab, say, all albums by foreigner and create a separate file for each one:
./show --exact --artist foreigner --export album songs.txt
Of course, you’re going to want to make sure that no album has the same name as a file you don’t want to erase: every time Perl opens a file, it will happily erase an existing file with the same name. We’ll see if we can do something about that in the next section.
And, of course, add this to the help subroutine:
print "\t--export <$validFields>: export to files named after the specified field\n"
You probably don’t want to play around too much making export files. It will be very easy to create hundreds of files in your current directory. We’ll fix this next.
It’s easy enough to change directory when exporting files in order to ensure that the new files go into a specific folder, but if you’re using this as part of a cron job it will be easier if you can tell the script which folder you want the export files to go to.
} elsif ($switch eq "folder") {
if ($exportFolder = shift) {
if (-e $exportFolder) {
#if the folder exists, it needs to be a folder
help("$exportFolder already exists and is not a folder.") if !-d $exportFolder;
}
} else {
help("The folder option requires a folder name.");
}
What’s with the new use of help()? Every single time we use help, we also exit. Every time except one, we print out an error message above the call to help. It’s about time we automated this. It can often be difficult to make the decision to change a function when only a minor change is needed; every time we’ve called help so far, it’s been a simple effort to add the extra line above and below the help call. But for simplicity’s sake it is time we combined those two to three lines into a single subroutine call.
At the top of the help subroutine, just below “sub help {“, add:
#if there is an error message, print it out separated from the rest
if (my($message) = shift) {
print "\n$message\n\n";
}
And at the very end, just before “}”, add:
exit;
At some point, you’ll want to go through and find every use of:
print "some text";
help();
exit;
and replace it with:
help("some text");
If you don’t do it now, add a comment to the top of your script reminding you to do it later.
The other part that’s new is “-e $exportFolder”. There are several tests you can perform on filenames. They all begin with a dash. This is the exists test. It is “true” if there is a file with that name. Remember that in Unix, folders are also files.
If that name is already being used, that’s fine if it’s a folder. So we need to make sure it’s a folder. The “-d” file test tests for that. So we call help if not “-d $exportFolder”. If $exportFolder is not a directory, the script will print the help, which with the above change will automatically exit the script.
Okay, add this option to our help subroutine:
print "\t--folder <foldername>: export files are created in the specified folder\n";
Now it’s time to implement it. Between “@matches = sort byCustom @matches if $sortby;” and “foreach $match (@matches) {” add:
#create a folder if necessary, and move into it
if ($exportField && $exportFolder) {
if (!-e $exportFolder) {
if (!mkdir($exportFolder)) {
print "Unable to create $exportFolder: $!\n";
exit;
}
}
if (!chdir($exportFolder)) {
print "Unable to move into $exportFolder: $!\n";
exit;
}
}
If there is something in $exportField (that is, if we are exporting into some files) and if there is something in $exportFolder (that is, if we are doing this into a specific folder), we need to ensure that the folder exists and that we are in it.
Step one checks to see if $exportFolder already exists, using the -e file test. If it doesn’t, the script tries to create it using “mkdir()”. This stands for make directory. If that works, fine, but if not (watch the exclamation point) the script prints the error and exits.
Whether we just created the folder or it already existed, the next step is to get into it. This is the “chdir()” function. It stands for change directory. It doesn’t mean change a directory, but change into a directory. It moves the script into that directory so that any files the script creates from now on will be created in that directory.
If chdir is not successful, the script prints an error and exits.
We can now export the foreigner albums into their own folder:
./show --exact --artist foreigner --export album --folder Foreigner songs.txt
This makes it easier for us to export to multiple files without cluttering up the current directory.
If you play around with export now, you’ll find that some exports don’t work. Go ahead and try:
./show --artist afroman --export genre songs.txt
What you should get back is:
Unable to open Hip Hop/Rap
: No such file or directory
There are two problems here. One, why is that error message separated onto two lines? It looks like we’re printing a new line between the export file name and the colon, but if you look in the code there is no such new line. The second, why would it tell us that there is no such file? We know that: that’s why we’re trying to create it.
The first problem is an easy one to fix. That new line is in the genre name itself. The genre is the last field on the line. When we get a line of text from a file in Perl, Perl includes the line break at the end of the line. When we split the line on tabs, the last item gets this line break, and that last item is the genre.
In front of the split line where we get song, duration, artist, etc., add:
chomp;
This will chomp any line endings off of the end of the current line. If you want to chomp the line endings off of a specific variable, you can use “chomp($variable)”. If you want to chomp the line endings off of a list of variables, you can use “chomp(@list)” or “chomp($variable1, $variable2, etc.)”.
Because we’re chomping the line endings off of the current line, we need to change the raw format as well. Change “$text = $_;” to:
$text = "$_\n";
Now the error message is easier to read:
Unable to open Hip Hop/Rap: No such file or directory
The problem here is that Unix uses the slash to separate directory names from each other and from the file name. Perl thinks the script wants to create a file called “Rap” in a folder called “Hip Hop”. There is no folder called “Hip Hop”, so this fails with that error.
We need to get rid of that slash. There is another form of regular expression that does this for us. Let’s add a general subroutine for replacing characters in a piece of text:
sub replace {
my($text, $from, $to) = (shift, shift, shift);
$text =~ s/$from/$to/g;
return $text;
}
Remember that we already know what shift does. Here, we’re just doing it three times in a row to grab each of the three items we will send this subroutine: the text we want to change, the characters we want to look for in that text, and the characters we want to replace it with. If we call “replace("omega man", "m", "d")”, we would expect to get back “odega dan”.
$text =~ s/$from/$to/g;
The “=~” means that this is a regular expression. The “s” in front of the first slash means that this is a substitution. By default regular expressions only match, they don’t perform any changes. A substitution will. In order to substitute, it needs to know what to substitute. That’s between the second slash and the new third slash. What we have here as $to will replace $from in $text.
By default, substitutions will only make one substitution. If we want the regular expression to affect all occurrences of $from, we need to specify that this is a global replace. The “g” after the final slash does this. The “g” is a modifier much like “i” for case insensitive matches.
If we want this to also be case insensitive, we could also add the “i” there too:
$text =~ s/$from/$to/gi;
What we want is to replace slashes with dashes, so replace “$matches[$#matches]{'file'} = $$exportField if $exportField;” with:
if ($exportField) {
$filename = $$exportField;
$filename = replace($filename, "/", "-");
$matches[$#matches]{'file'} = $filename;
}
If we run the Afroman export again:
./show --artist afroman --export genre songs.txt
We now have a file called “Hip Hop-Rap”.
You might choose a different character to replace slashes. It must be a valid character for your operating system or you’ll continue to get some sort of error.
One of the most important skills to learn when you’re programming is learning how to break your scripts. You’ll want to do lots of tests with lots of different kinds of data, but tests can only find errors that you test on. You will also need to think about where will this break? and fix those errors before they happen. We’ve done a little of this already, without calling it that. This is why we put “open(...)” in an “if” statement and match it with an “else” that displays any errors that crop up.
It is especially important to think about how new functionality can break your script. So we recently added the ability to search, sort, and export by new fields. We can add any field we want to @validFields and search, sort, and export by that field. How can that break our script?
One way it could break our script is if we try to export on that field but the song doesn’t have anything entered for it.
In fact, that might be the case even with our current fields. How can we find fields that don’t have anything in them?
We’re doing our search by regular expression. We haven’t blocked regular expression characters from the search text. Try:
./show --artist ^Night songs.txt
./show --artist Night$ songs.txt
The first one will show only those songs by artists whose name begins with “night”. The second shows artists whose names end with “night”. Because that’s the regular expression character for anchoring to the beginning or end of a text.
There’s also a regular expression for any character. If we search on that, and then get the reverse, we can find fields that have no character in them.
./show --artist . --reverse songs.txt
./show --song . --reverse songs.txt
./show --album . --reverse songs.txt
./show --genre . --reverse songs.txt
There are artists, albums, and genres that are completely empty. Try to export on artist and you’ll get an error:
./show --artist . --reverse --export artist songs.txt
Unable to open : No such file or directory
There’s no way to open a file that has no name. We need to check for empty filenames and give them some other name, such as “Unknown artist”.
In the “if ($exportField) area that we just changed, change it again:
if ($exportField) {
$filename = $$exportField;
if ($filename ne "") {
$filename = replace($filename, "/", "-");
} else {
$filename = "Unknown $exportField";
}
$matches[$#matches]{'file'} = $filename;
}
Now, if $filename ends up being not equal to an empty string, we do the replace as normal. Otherwise, we assign “Unknown $exportField” to $filename. If we are exporting by artist, it will say “Unknown artist”.
The “if ($filename ne "") {” is a little different from “if ($filename = $$exportField) {” which we could have done. The former will only match if $filename is empty. The latter would also match if $filename was zero. We might imagine wanting to export based on rating, so that we have a list of songs each in a file named after their rating. A rating of zero would get a filename of “Unknown rating” if we used the latter form, but will get a filename of “0” with the one we used.
Some data is time-sensitive. The file came in at a specific time, and you want the exported files to keep that timestamp. Under Unix, you can see a file’s last modified time using “ls -l”. If you look at songs.txt you’ll probably see that it was last modified on April 25, 2005. If you look at the export files you’ve created, their last modified time is today, or the day you exported them.
First, add the switch:
} elsif ($switch eq "keep-time") {
$keepTime = 1;
and then the help:
print "\t--keep-time: keep the input file's timestamp on any exported files\n";
If we’re going to stamp the files we create so that they have the same timestamp as the file the data came from, we need to get the timestamp of that file. So far we haven’t cared what file that is. In fact, our script is designed to allow multiple files to be specified on the command line. We might imagine exporting raw artist files of all Rock songs, for example, and then searching through the files for multiple artists.
So the first step is purely on our part, with no coding. If more than one file is specified, what is the correct timestamp? Do we want the most recent one? The oldest one? Some sort of average? I’m going to assume that we want the most recent one.
The second problem is that in order to get the timestamp for a file, we need to know the file’s name. So far we haven’t cared. We’ve let Perl handle the file input for us. Fortunately, there’s no need to change that. Perl can also tell us the name of the current file. When a script loops through file input, Perl puts the current filename in a special variable called $ARGV.
Below the “if ($matched) {” line, add:
if ($keepTime) {
@fileInfo = stat($ARGV);
$fileMod = $fileInfo[9];
$lastModified = $fileMod if $fileMod > $lastModified;
}
Simple enough. If $keepTime has something in it, we grab the information for the file called $ARGV. The stat() function returns a bunch of information about a file; we want the ninth piece. That’s the last modified time of the file.
Then, we set $lastModified to be this file’s modification time if $fileMod is larger than (more recent than) the current $lastModified. The first time around, $lastModified has nothing in it, so anything will be greater than it. After that, $lastModified only gets changed if the current file is newer than the previous newest file.
One minor problem with this is that it is checking the current file every time we go through the loop. File system access is usually very fast, but if we’re exporting thousands of records from a handful of files that’s thousands of stat calls we don’t really need. What we can do is keep track of the filename, and only get the last modified when $ARGV no longer matches the previously current filename:
if ($keepTime && $lastFile ne $ARGV) {
@fileInfo = stat($ARGV);
$fileMod = $fileInfo[9];
$lastModified = $fileMod if $fileMod > $lastModified;
$lastFile = $ARGV;
}
So, now we have the timestamp we need, we just need to set each file to have that timestamp. The easiest place to handle this is after we close each file. The script already goes through each file one by one to close it. We can set the last modified time during that loop. Change the entire “#close all open files” section to:
#close all open files
foreach $filename (keys %files) {
$filehandle = $files{$filename};
close($filehandle);
utime($lastModified, time(), $filename);
}
Instead of just grabbing the values (file handles) out of %files, we need the keys as well. The keys are the filenames. So, we grab the keys and then grab the values using the key as normal. We close $filehandle just as we always did, and then we run utime on $filename. Each file has two times that are commonly used: the last modified time and the last accessed time. The utime function requires both of them, so we’ll set the first one (the last modified time) to the saved $lastModified from the input file(s). We’ll set second (last access time) to the current time, since that’s when the file was last accessed.
This script is getting pretty big, but we’re almost done with it. Here is how it stands so far.
#!/usr/bin/perl
#Search for songs in a file of the following tab-separated data:
# title, duration, artist, album, year, rating, rip date, track position, genre
#options for the --format switch
@validFormats = ("raw", "simple", "html", "summary");
$validFormats = englishJoin(", ", "or", @validFormats);
#options for fields to search in
@validFields = ("artist", "album", "song", "genre");
$validFields = englishJoin(", ", "and", @validFields);
#strip off the command-line switches and act on or remember them
while ($ARGV[0] =~ /^--(.+)/) {
$switch = $1;
#pull this switch off of the front of the list
shift;
#if they ask for help, do it and exit
if ($switch eq "help") {
help();
} elsif ($switch eq "case") {
$sensitive = 1;
} elsif ($switch eq "reverse") {
$reverse = 1;
} elsif ($switch eq "limit") {
$limit = shift;
if ($limit !~ /^[1-9][0-9]*$/) {
help("You must limit to a number, such as '33' or '2'.");
}
} elsif ($switch eq "format") {
$format = shift;
if (!grep(/^$format$/, @validFormats)) {
help("Format must be $validFormats.");
}
} elsif (grep(/^$switch$/, @validFields)) {
if ($searchText = shift) {
$searches{$switch} = $searchText;
} else {
help("Searching in $switch requires text to search on.");
}
} elsif ($switch eq "sort") {
$sortby = shift;
if (!grep(/^$sortby$/, @validFields)) {
help("I can only sort by $validFields.");
}
} elsif ($switch eq "exact") {
$exact = 1;
} elsif ($switch eq "export") {
$exportField = shift;
if (!grep(/^$exportField$/, @validFields)) {
help("I can only export by $validFields.");
}
} elsif ($switch eq "folder") {
if ($exportFolder = shift) {
if (-e $exportFolder) {
#if the folder exists, it needs to be a folder
help("$exportFolder already exists and is not a folder.") if !-d $exportFolder;
}
} else {
help("The folder option requires a folder name.");
}
} elsif ($switch eq "keep-time") {
$keepTime = 1;
} else {
help("I do not understand the option '$switch'.");
}
}
#the first item on the command line is what we're searching for
if (%searches) {
#if we're looking for exact matches, set them up ahead of time
if ($exact) {
foreach $search (keys %searches) {
$searchText = $searches{$search};
$searches{$search} = "^$searchText\$";
}
}
while (<>) {
#split out the song information
chomp;
($song, $duration, $artist, $album, $year, $rating, $ripdate, $track, $genre) = split("\t");
foreach $searchField (keys %searches) {
$needle = $searches{$searchField};
$haystack = $$searchField;
$matched = match($haystack, $needle);
last if !$matched;
}
#reverse the match if we want non-matching lines
if ($reverse) {
$matched = !$matched;
}
#print the information if this line is one we want
if ($matched) {
#maintain the timestamp if we need it
if ($keepTime && $lastFile ne $ARGV) {
@fileInfo = stat($ARGV);
$fileMod = $fileInfo[9];
$lastModified = $fileMod if $fileMod > $lastModified;
$lastFile = $ARGV;
}
$matches++;
if ($format eq "raw") {
$text = "$_\n";
} elsif ($format eq "html") {
$text = "<tr><td>$song</td><td>$album</td><td>$artist</td></tr>\n";
} elsif ($format eq "summary") {
$artists{$artist}++;
} else {
$text = "$song ($album, by $artist)\n";
}
#store or print the display text and the sort text
if ($sortby || $exportField) {
$matches[$#matches+1]{'text'} = $text;
$matches[$#matches]{'sort'} = $$sortby if $sortby;
if ($exportField) {
$filename = $$exportField;
if ($filename ne "") {
$filename = replace($filename, "/", "-");
} else {
$filename = "Unknown $exportField";
}
$matches[$#matches]{'file'} = $filename;
}
} else {
print $text;
}
}
last if $limit && $matches >= $limit;
}
if (%artists) {
@artists = keys %artists;
@artists = sort byArtistCount @artists;
foreach $artist (@artists) {
$artistCount = $artists{$artist};
print "$artist: $artistCount\n";
}
} elsif (@matches) {
@matches = sort byCustom @matches if $sortby;
#create a folder if necessary, and move into it
if ($exportField && $exportFolder) {
if (!-e $exportFolder) {
if (!mkdir($exportFolder)) {
print "Unable to create $exportFolder: $!\n";
exit;
}
}
if (!chdir($exportFolder)) {
print "Unable to move into $exportFolder: $!\n";
exit;
}
}
foreach $match (@matches) {
if ($exportField) {
$filename = $$match{'file'};
#open the file if we haven't already
if (!$files{$filename}) {
if (!open($files{$filename}, ">$filename")) {
print "Unable to open $filename: $!\n";
exit;
}
}
$filehandle = $files{$filename};
print $filehandle $$match{'text'};
} else {
print $$match{'text'};
}
}
#close all open files
foreach $filename (keys %files) {
$filehandle = $files{$filename};
close($filehandle);
utime($lastModified, time(), $filename);
}
}
} else {
help();
}
#describe how this script is used
sub help {
#if there is an error message, print it out separated from the rest
if (my($message) = shift) {
print "\n$message\n\n";
}
print "Syntax: show [options] [song files]\n";
print "\tSearch for some text in the song file. If no song file is specified\n";
print "\t'show' will expect it on standard input.\n";
print "\tA song file is a tab-delimited file with:\n";
print "\ttitle, duration, artist, album, year, rating, rip date, track position, genre\n";
print "\t--help: print this help text\n";
print "\t--case: be sensitive to upper and lower case\n";
print "\t--reverse: filter out songs that contain the search text\n";
print "\t--limit x: limit to x results\n";
print "\t--format <$validFormats>: choose format for results\n";
print "\t--$validFields <searchtext>: search in the $validFields field\n";
print "\t--sort <$validFields>: sort by specified field\n";
print "\t--exact: the search text must match exactly\n";
print "\t--export <$validFields>: export to files named after the specified field\n";
print "\t--folder <foldername>: export files are created in the specified folder\n";
print "\t--keep-time: keep the input file's timestamp on any exported files\n";
print "At least one of the $validFields search requests must be specified.\n";
exit;
}
sub byArtistCount {
return $artists{$b} <=> $artists{$a};
}
sub englishJoin {
my($punctuation) = shift;
my($conjunction) = shift;
my(@items) = @_;
my($joined, $finalItem);
if ($#items == -1) {
$joined = "";
} elsif ($#items == 0) {
$joined = $items[0];
} elsif ($#items == 1) {
$joined = "$items[0] $conjunction $items[1]";
} else {
$finalItem = pop(@items);
$joined = join($punctuation, @items) . "$punctuation$conjunction $finalItem";
}
return $joined;
}
sub match {
my($searchIn) = shift;
my($searchFor) = shift;
my($matched) = 0;
if ($sensitive) {
$matched = $searchIn =~ /$searchFor/;
} else {
$matched = $searchIn =~ /$searchFor/i;
}
return $matched;
}
sub byCustom {
if ($sensitive) {
return $$a{'sort'} cmp $$b{'sort'};
} else {
return lc($$a{'sort'}) cmp lc($$b{'sort'});
}
}
sub replace {
my($text, $from, $to) = (shift, shift, shift);
$text =~ s/$from/$to/g;
return $text;
}