Smarter scripts

Capturing errors

It isn’t that difficult to trip up the script we’ve got so far. If you just type ./show and press return, not only does it wait on the command line for us to type something, it doesn’t even know that we didn’t tell it to search for anything.

Often, when you can identify command line arguments that you know are wrong, you will want to check for those arguments, and print out an instruction text when someone types something unexpected.

In this case, if there is nothing to search for, the person using the script probably doesn’t know how to use the script. We can tell them how to use it when they do something wrong. Change the script by adding an “if” line above the “while”, indenting everything, and then adding several lines at the bottom:

#!/usr/bin/perl
#Search for songs in a file of the following tab-separated data:
# title, duration, artist, album, year, rating, rip date, track position, genre

#the first item on the command line is what we're searching for
if ($searchFor = shift) {
while (<>) {
#split out the song, duration, artist, and album
($song, $duration, $artist, $album) = split("\t");

#print the information if this line contains our search text
print "$song ($album, by $artist)\n" if /$searchFor/i;
}
} else {
help();
}

#describe how this script is used
sub help {
print "Syntax: show <search text> [song files]\n";
print "\tSearch for <search text> in the song file. If no song file is specified\n"; print "\t'show' will expect it on standard input.\n";
print "\tA song file is a tab-delimited file with:\n";
print "\ttitle, duration, artist, album, year, rating, rip date, track position, genre\n";
}

The word “if” starts a block very much like the word “while” does. Unlike while, however, an if block is only performed once. Otherwise it is very similar. If the expression inside the parentheses of the if line returns something, the if block is performed. Otherwise, it isn’t. Some if blocks have a corresponding else block. If so, the else block is only performed if the if block is not performed.

Notice how the while block is indented further beyond the indentation of the if block. I indented it further once I placed the if block around it. You should do so also. As your scripts become more and more complex, failure to indent will make it practically impossible to fix errors.

The word “sub” also starts a block. Unlike while and if, however, a sub block is never performed unless asked. The word that follows sub is the name of this subroutine. It is how we ask Perl to perform this block. Anywhere where we have that name followed by two parentheses, Perl will perform the sub block corresponding to that name.

In our case, if the shift does not assign something into $searchFor, we call the help subroutine. The term subroutine is somewhat archaic. We almost never use the term routine anymore, and even subroutine is fading from use. But that’s the origin of the word sub to mark these blocks of Perl lines.

Help!

One of the advantages of subroutines is that partitioning off some Perl lines allows us to call those lines from multiple places without having to retype the lines. This improves the readability of our script and also the reliability. If we make a mistake in the subroutine, we can fix it in the subroutine.

We’ve got this subroutine called help but currently the only way to see it is to do something wrong. It might be nice to ask for the help without having to do something wrong.

When we want to alter the way a program works from the command line, we usually use switches. In Unix, switches usually begin with a single dash if they are a single character, or double dashes if they are a word. We’ll use words here just to make them easier to read. For example, to display the “help” message, we might use “./show -help”.

Add the following five lines above the “first item on the command line” comment:

#if they ask for help, do it and exit
if ($ARGV[0] eq "--help") {
help();
exit;
}

Now, type:

./show --help

And you should see the help message displayed. It doesn’t matter what else you type on the command line, as long as the first argument is “--help” you’ll get the help message and that’s it.

The important new section is the one that checks $ARGV[0]. @ARGV is the list of all command-line arguments. In Perl, lists--often called arrays, or simple arrays--begin with the @ character. If, however, you want an item in the array, you preface it with the dollar sign.

$ARGV[0] is the first item in the list called ARGV. Perl, like many programming languages, starts counting from zero rather than from one. The first item in a list is item 0, not item 1. The second item is $ARGV[1] (if there is one), and so on.

What we’re checking is whether or not the first argument is equal to “--help”. If it is, we call the help subroutine and then exit. In Perl, exit will end the script completely. It doesn’t matter what else comes after the exit line, Perl ends the script and returns you to the command line.

Finally, don’t forget to add a line to the help text describing how to get help:

print "\t--help: print this help text\n"

You’ll always want to update your help subroutine whenever you add new features to your script, or modify existing features.

Command-line switches

So now we have a script with one command-line switch, but switches are like potato chips: once you start, you can’t have just one. Here’s an example: we’ve currently made our script case-insensitive, so that we don’t have to worry about remembering the exact case of the text we’re looking for. But what if we want it to be case-sensitive? Let’s add a case switch.

To do this, though, we’re going to need to “generalize” our search for command-line switches. If we just have a series of ifs, that will mean that either we can’t have more than one command-line switch, or we have to put them in an exact order. That will always be too difficult to remember, especially when we have eight or more switch possibilities.

One way of doing this is to loop through the beginning arguments as long as the argument is a switch. Stop looping when it is no longer a switch. We can use a while block for this. Replace our help switch’s five lines with:

#strip off the command-line switches and act on or remember them
while ($ARGV[0] =~ /^--(.+)/) {
$switch = $1;

#pull this switch off of the front of the list
shift;

#if they ask for help, do it and exit
if ($switch eq "help") {
help();
exit;
}
}

This snippet does the same thing as the previous “help-only” snippet, but it will allow us to add any switches we want.

while ($ARGV[0] =~ /^--(.+)/) {

The “=~” is new. It is used to match a scalar variable against a regular expression. The variable goes on the left, and the regular expression goes on the right.

And what a regular expression! Let’s take it piece by piece.

It begins with a caret, or “hat” character. The caret marks the beginning of a piece of text. Whatever comes next in the regular expression will only match if it comes at the beginning of the text. So, since the next two characters are two dashes, two dashes will only match if the two dashes are at the beginning. This differs from our previous regular expressions, where the text we specified could occur anywhere on the line.

After the two dashes, we have “(.+)”. The parentheses are easy: they tell Perl to remember that part of the match, whatever it is. We’ll see what that means in a moment.

The period, or “dot”, matches a single character. It can be any character.

The plus sign matches one or more of the previous piece of the regular expression. The previous piece is the dot, so the dot and the plus means one or more of any character.

Taken as a whole, this regular expression will match --help, --switch, --q, --rain, or even --planet-99x. It will not match help--, switch--station, or anything else that does not begin with two dashes.

$switch = $1;

The next line assigns the value of the variable $1 to the variable $switch. After a regular expression, Perl remembers any items in parentheses, and it remembers them by putting them into $1, $2, $3, $4, etc., on up to however many sets of parentheses were in the regular expression. We only have one set of parentheses, so we only get $1.

Because our parentheses were after the two dashes, $switch will now contain the part of that switch not including the initial two dashes.

#pull this switch off of the front of the list
shift;

We’ve already used shift. It shifts an item off of the front of a list. By default, it shifts it off of the front of the list of command-line arguments. Since @ARGV is the list of arguments, shift shifts the first argument off of @ARGV. Once shifted off, that first argument is gone. What used to be the second argument is now the first argument. This gets us ready for the next turn through the loop. $ARGV[0] is now the next argument.

#if they ask for help, do it and exit
if ($switch eq "help") {
help();
exit;
}

This section looks very familiar. The only part that’s changed is the if line. Instead of checking to see if $ARGV[0] is equal to “--help”, we’re checking to see if $switch is equal to “help”. If it is, we call the help subroutine and exit the script.

Case sensitivity

So after all that, we still haven’t added case sensitivity to the script. But now, we can add pretty much anything we want. Our switch is going to be called case. What we’ll do is set $sensitive if we want the search to be case sensitive.

Replace the closing curly bracket where we’re checking for the help option with:

} elsif ($switch eq "case") {
$sensitive = 1;
}

The whole switch section should look like:

if ($switch eq "help") {
help();
exit;
} elsif ($switch eq "case") {
$sensitive = 1;
}

This is pretty simple, so far. If the command-line switch is “--case” then assign the number 1 to the variable $sensitive.

Replace the line that prints out the song information with:

if ($sensitive) {
$matched = /$searchFor/;
} else {
$matched = /$searchFor/i;
}
#print the information if this line contains our search text
print "$song ($album, by $artist)\n" if $matched;

If $sensitive has something in it, we match without the case-insensitivity modifier. Otherwise, we match with the case-insensitivity modifier. We assign the result of that match to $matched, and then if $matched has something in it we print the song information. (You might ask why we don’t just create a variable that either has “i” in it or not. The answer is that we can’t do that. A variable won’t work in that part of a regular expression.)

Now, you can search for:

./show --case Yellow songs.txt

./show --case yellow songs.txt

And get different results. Add the help line to the help subroutine and we’re done with this option:

print "\t--case: be sensitive to upper and lower case\n";

Boolean logic

So far our search has been for things that match. But sometimes filters are useful to filter out rather than filter in. We can add a switch that will cause the script to print out only those songs that don’t match the search. To do this, we’re going to need to understand Boolean logic. In case you’ve forgotten your Boolean logic from high school, it is basically about true and false. Perl treats items that contain something other than zero as true. It treats items that contain zero or nothing as false.

Code in your if or while parentheses are treated by Perl as Boolean expressions, as true or false.

You can reverse something from false to true or true to false with NOT, which in Perl is the exclamation point.

First, add the switch to our list of switches:

} elsif ($switch eq "reverse") {
$reverse = 1;
}

Then, add some new code to the while loop, after we assign a match or lack thereof to $match, but before we print the song information:

#reverse the match if we want non-matching lines
if ($reverse) {
$matched = !$matched;
}

You might also want to change the comment in front of the print line:

#print the information if this line is one we want

And, of course, add a line to the help subroutine:

print "\t--reverse: filter out songs that contain the search text\n";

So, now, if you want to see every song that does not mention best anywhere, use:

./show --reverse best songs.txt

Well. That showed a lot. Let’s see if we can do something about that.

Exiting loops ahead of time

Often when we’re testing we don’t really want to see everything, we just want to see that it worked. The first several results will let us know that. Let’s put in a switch to limit the number of results to a specified maximum. We’ll call this switch limit and it will be followed by a number.

} elsif ($switch eq "limit") {
$limit = shift;
if ($limit !~ /^[1-9][0-9]*$/) {
print "\nYou must limit to a number, such as '33' or '2'.\n\n";
help();
exit;
}
}

First, we assign the result of the shift to a variable--our $limit variable. The next item on the command line after --limit should be the number of lines we want to limit to. Just to make sure, we check:

if ($limit !~ /^[1-9][0-9]*$/) {

This is a different form of regular expression. Instead of “=~” it is “!~”. This will match if the regular expression doesn’t match the text on the left. Remember, an exclamation point often stands in for the word not in Perl. Other than that, this regular expression is the same as any other.

We’ve already met the caret. When it is at the beginning of a regular expression, it matches the beginning of the text. The dollar sign, when it appears at the end of a regular expression, matches the end of the text. Square brackets match a list or range of characters. Here we’re using them to match a range. The first character must be a digit from 1 to 9. The second character must be a digit from 0 to 9. Normally, this would mean that the limit would have to be 10 to 19. However, immediately following the “[0-9]” there is an asterisk. The asterisk is just like the plus symbol in regular expressions except that it matches zero or more occurrences of the preceding piece of text instead of one or more.

Since the preceding character is any digit from 0 to 9, the combination of “[0-9]” and an asterisk matches zero or more digits. Matches would include 100, 1, 9, 19, 55, 637. Non-matches would include 01, 99X, Buffalo99. Any text that includes non-numbers or that begins with a zero will not match.

If the text following the --limit switch isn’t a number, the script warns them that it needs a number there, calls the help subroutine, and exits.

So, that’s a little bit more complicated of a switch. How do we handle implementing it?

Replace the line that prints the song information with:

#print the information if this line is one we want
if ($matched) {
$matches++;
print "$song ($album, by $artist)\n";
}
last if $limit && $matches >= $limit;

We’ve moved the if off of the print line and instead created an if block. We have Perl perform this block if $matched has something in it, that is, if it is true. If we have a match, the first thing we do is increment the variable $matches by 1. That’s what the “++” does. When “++” follows a variable, Perl will add one to that variable. If the variable doesn’t exist, or if it is not a number, Perl assumes it is 0 and sets it to 1.

Thus, $matches will count up the number of matches we have hit so far.

Outside of the if block, we have a new command: last. The last command exits the current loop, even if the loop wouldn’t otherwise be finished.

We have an if following the last command, however, so the last only gets performed if “$limit && $matches >= $limit”.

In other words, if $limit has something in it AND if $matches is greater than or equal to $limit.

If $limit doesn’t have anything in it--if we didn’t specify a limit--the last never gets performed. If $limit does have something in it, the last will get performed if $matches ever equals or exceeds the limit we specified on the command line.

Remember to add the help line:

print "\t--limit x: limit to x results\n";

You can now do searches and limit the results. Try the non-best-of search again, with a limit of 10:

./show --reverse --limit 10 best songs.txt

./show --reverse --limit 12 aerosmith songs.txt

And the screen no longer fills up with the thousands of non-matching songs.

Multiple options

Some switches will have a small list of options to choose from. For example, we modified our script to display the song information in a more human-readable format. But what if we want to keep the raw format under some circumstances? Maybe we want to take the raw song listing, filter out some albums we no longer have, and then create a new raw listing from that filter.

It we want that, it makes sense to create a --format switch that can take only two options: raw and (say) simple. Add the following lines to the switch section:

} elsif ($switch eq "format") {
$format = shift;
if ($format ne "raw" && $format ne "simple") {
print "\nFormat must be raw or simple.\n\n";
help();
exit;
}
}

Pretty normal stuff here. The letters “ne” stand for “not equal” to. So, if the user specifies a format that is not raw and that is not simple, the script displays the help and exits.

Now, replace the line that prints the song information with:

if ($format eq "raw") {
print;
} else {
print "$song ($album, by $artist)\n";
}

What we’re really doing here is printing the current line if the user specified a format of raw, and printing the simpler information in every other case. But this may change later if we add more formats. You will usually want to include the default as an option, just in case you make changes later.

./show --limit 12 --format raw aerosmith songs.txt

./show --limit 12 --format simple aerosmith songs.txt

If you were filtering out information to a new file, you might do redirect the output to that file:

./show --format raw aerosmith songs.txt > aerosmith.txt

./show love aerosmith.txt

You can also type “more aerosmith.txt” to verify that it has what you expect: all songs by Aerosmith.

You’ll want to add format to the help subroutine:

print "\t--format <raw or simple >: choose format for results\n";

Script confusion

What happens if you misspell a switch? Try:

./show --limitt 12 aerosmith songs.txt

What is it doing? Those lines don’t contain “aerosmith”. It’s understandable that the script wouldn’t stop at 12 because we misspelled limit, but what is it showing us? Try:

./show --limitt 12 aerosmith songs.txt | more

and then scroll up a line:

Can't open aerosmith: No such file or directory at ./show line 32.

It isn’t looking for lines containing “aerosmith”. It thinks “aerosmith” is a file that it needs to search through. What is it looking for? All lines that contain the mention of “12”. That’s because the script saw --limitt as a possible switch, and shifted it off the argument list. But it did not see 12 as a possible switch so it left it on. Our script grabs the first item on the argument list as what to search for. In this case, that was 12.

What we need to do is have the script stop when it hits something it doesn’t understand. That’s easy enough to do. Add another switch:

} else {
print "\nI do not understand the option '$switch'.\n\n";
help();
exit;
}

This section must always be the final section of the switch area. If we’re in the switch area it is because the script saw a double-dash. If we get to the final “else”, that is because none of our known switches matched the text following the double-dash. That’s going to be either because the user misspelled it or because the user doesn’t understand what this script does.

So, we have the script tell the user this, call the help subroutine, and exit.

The current script

Just so we’re on the same page, here is what the script currently looks like:

#!/usr/bin/perl
#Search for songs in a file of the following tab-separated data:
# title, duration, artist, album, year, rating, rip date, track position, genre

#strip off the command-line switches and act on or remember them
while ($ARGV[0] =~ /^--(.+)/) {
$switch = $1;

#pull this switch off of the front of the list
shift;

#if they ask for help, do it and exit
if ($switch eq "help") {
help();
exit;
} elsif ($switch eq "case") {
$sensitive = 1;
} elsif ($switch eq "reverse") {
$reverse = 1;
} elsif ($switch eq "limit") {
$limit = shift;
if ($limit !~ /^[1-9][0-9]*$/) {
print "\nYou must limit to a number, such as '33' or '2'.\n\n";
help();
exit;
}
} elsif ($switch eq "format") {
$format = shift;
if ($format ne "raw" && $format ne "simple") {
print "\nFormat must be raw or simple.\n\n";
help();
exit;
}
} else {
print "\nI do not understand the option '$switch'.\n\n";
help();
exit;
}
}

#the first item on the command line is what we're searching for
if ($searchFor = shift) {
while (<>) {
#split out the song, duration, artist, and album
($song, $duration, $artist, $album) = split("\t");

if ($sensitive) {
$matched = /$searchFor/;
} else {
$matched = /$searchFor/i;
}

#reverse the match if we want non-matching lines
if ($reverse) {
$matched = !$matched;
}

#print the information if this line is one we want
if ($matched) {
$matches++;
if ($format eq "raw") {
print;
} else {
print "$song ($album, by $artist)\n";
}
}
last if $limit && $matches >= $limit;
}
} else {
help();
}

#describe how this script is used
sub help {
print "Syntax: show <search text> [song files]\n";
print "\tSearch for <search text> in the song file. If no song file is specified\n"; print "\t'show' will expect it on standard input.\n";
print "\tA song file is a tab-delimited file with:\n";
print "\ttitle, duration, artist, album, year, rating, rip date, track position, genre\n";
print "\t--help: print this help text\n";
print "\t--case: be sensitive to upper and lower case\n";
print "\t--reverse: filter out songs that contain the search text\n";
print "\t--limit x: limit to x results\n";
print "\t--format <raw or simple >: choose format for results\n";
}