Arrays and functions

We’ve done a little bit with an array already: the list of arguments to the script is a simple array. We’ve only ever referenced the first item in that array, shifting that first item out so that the next items is now first. We can do quite a bit more with arrays in Perl.

Besides simple arrays, there are also associative arrays. An associative array is one which, instead of using numbers to reference the values in the array, uses keys. It associates a key with a value. So instead of asking for the first, second, or third item in the list, you can ask for the value that corresponds to “The Band”, or the value that corresponds to “Jane Jensen”.

For example, we might want to create a third format, one that summarizes songs by artist, showing how many songs each artist has in the matches.

If we’re going to have a bunch of formats, it will be easier to keep a list of them. Add the following lines just above the “strip off the command-line switches” section:

#options for the --format switch
@validFormats = ("raw", "simple", "summary");
$validFormats = join(", ", @validFormats);

The first line (below the comment) assigns a simple array of three items: raw, simple, and summary. I mentioned it in passing earlier, but all simple arrays begin with the @ symbol.

The second line assigns the result of the “join” function to a scalar variable called $validFormats. The “join” function combines an array into a scalar, using the first argument as its glue. Here, we specify a command and a space as the “glue”, so $validFormats will be “raw, simple, summary”.

Functions are like subroutines, but they are built in to Perl.

Don’t get confused by the fact that the scalar variable $validFormats and the simple array @validFormats have the same text for their name. They are not the same variable, and as far as Perl is concerned they are completely unrelated.

Now, inside the switches area, change, the “if ($format ne...” line and the print following it to:

if (!grep(/^$format$/, @validFormats)) {
print "\nFormat must be $validFormats.\n\n";

The second line is simple enough: instead of us typing the valid formats, we’re using the automatically-created variable that holds them as a piece of text.

The first line uses the grep function to check whether or not $format exists in the array @validFormats. Like join, grep takes two arguments, and the second one is a list. The first one, however, is a regular expression. So in that line, grep is checking to see if any of the items in @validFormats begins and ends with $format: the caret anchors $format to the beginning, and the dollar sign anchors it to the end.

Go ahead and try a few options and see how they work. Both ‘simple’ and ‘summary’ will currently do the same thing, since we haven’t added any code for ‘summary’.

./show --format unknown girl aerosmith.txt

./show --format raw girl aerosmith.txt

./show --format summary girl aerosmith.txt

So the next step is to handle the summary format. Where the script prints out the song information, between the raw and simple format, add:

} elsif ($format eq "summary") {
$artists{$artist}++;

That section should now be:

if ($format eq "raw") {
print;
} elsif ($format eq "summary") {
$artists{$artist}++;
} else {
print "$song ($album, by $artist)\n";
}

Go ahead and try for a summary:

./show --format summary girl songs.txt

Nothing should happen. When we ask for a summary, we are no longer printing anything, but only keeping track by incrementing... what are we incrementing?

$artists{$artist}++;

The “++” we’ve already met: it increments the variable to the left of it. The variable to the left looks vaguely like a value from a simple array, except that instead of using square brackets we’re using curly brackets. That’s how you tell the difference between a simple array and an associative array. Simple arrays use square brackets to get at their individual values, and associative arrays use curly brackets to get at their individual values.

If $artist contains “Eurythmics”, this will add one to the value of $artists{"Eurythmics"}. If that value didn’t previously exist, it is assumed to be 0 and now is 1. If it was 1, it is now 2, and so on.

Finally, just outside of the end of the while block that loops through the song information, we can print out the summary:

if (%artists) {
@artists = keys %artists;
@artists = sort @artists;
foreach $artist (@artists) {
$artistCount = $artists{$artist};
print "$artist: $artistCount\n";
}
}

If the associative array %artists exists--that is, if we’ve been keeping track of how many songs each artist has--we’ll perform the rest of this if block.

The first line inside the block gets the keys out of the %artists associative array. The keys are a simple list, so they go into @artists.

The second line sorts @artists, and then assigns the sorted @artists back to itself.

The next block is a foreach block. Very much like a while block, it loops through its lines for as long as it has something to loop through. The difference is that foreach gets its things to loop through from a simple array, in this case @artists. Foreach places each piece into the first item, in this case the scalar variable $artist.

So if there are three matching artists, Pink Floyd, Warren Zevon, and Stillwater, the first time through $artist will contain “Pink Floyd”, the second time through “Warren Zevon”, and the third time through “Stillwater”.

Inside the foreach block, the first line assigns the artist’s total songs to the variable $artistCount, and the second line prints out the artist’s name and count.

./show --format summary stand songs.txt

You should get several lines, including that Bing Crosby has 23 songs, Taco 11, and William S. Burroughs 1 matching “stand”.

Change the help subroutine to reflect the new format:

print "\t--format <$validFormats>: choose format for results\n";

Sort numerically

By default, “sort” will sort alphabetically by value. But if we’re willing to write our own subroutine we can sort by pretty much any criteria we want. Create a “byArtistCount” subroutine:

sub byArtistCount {
return $artists{$b} <=> $artists{$a};
}

Add this to the sort line:

@artists = sort byArtistCount @artists;

And run the command again:

./show --format summary stand songs.txt

The top four artists should be Judy Garland, Bing Crosby, The Lennon Sisters, and Linda Ronstadt, at 51 songs, 23 songs, 14 songs, and 12 songs, respectively.

This subroutine is a special one for sorts. When sort calls a sort subroutine it is asking that subroutine which of two items should come first. Perl automatically puts the first item in $a and the second item in $b. If the subroutine returns a negative 1, sort assumes that $a comes first. If the subroutine returns a positive 1, sort assumes that $b comes first. If the subroutine returns a 0, sort assumes that both can be ordered either way.

The “<=>”” is a useful operator for sort subroutines, because it returns a negative 1 if the number on the left is lower, a positive 1 if the number on the right is lower, and a zero if both numbers are the same.

Which means that this subroutine ends up sorting the artist names according to their count in %artists; and because I’ve put $b on the left and $a on the right, it sorts in descending order.

If you’re comparing two pieces of text rather than two numbers, the “cmp” operator does the same thing for text that “<=>” does for numbers. Here’s a quick script that lets you play around with compares:

#!/usr/bin/perl

$item1 = shift;
$item2 = shift;

print "Text compare: ", $item1 cmp $item2, "\n";
print "Number compare: ", $item1 <=> $item2, "\n";

Call it “compare”, make sure you set it to chmod u+x, and play around with giving it two items:

./compare hello world

./compare 3 5

A smarter join

Go back and ask for some format that doesn’t exist:

./show --format wriggling stand songs.txt

Format must be raw, simple, summary.

That should really be raw, simple, or summary. It’s grating to read otherwise. We can make our own subroutine that joins lists together but accepts a conjunction as well as a simple separator.

sub englishJoin {
my($punctuation) = shift;
my($conjunction) = shift;
my(@items) = @_;

my($joined, $finalItem);

if ($#items == -1) {
$joined = "";
} elsif ($#items == 0) {
$joined = $items[0];
} elsif ($#items == 1) {
$joined = "$items[0] $conjunction $items[1]";
} else {
$finalItem = pop(@items);
$joined = join($punctuation, @items) . "$punctuation$conjunction $finalItem";
}

return $joined;
}

This subroutine is expecting that the first parameter it gets is the punctuation (the comma, in our case), the second item it gets is the conjunction (“or”), and the rest of the items is the list that needs to be joined. The symbols @_ in a subroutine mean the list of parameter the subroutine has received, much like @ARGV means the list of command-line arguments. Inside of a subroutine, shift automatically shifts items out of @_ instead of @ARGV.

Subroutines, by default, have access to all of the variables that the script uses. We used this to our advantage in the byArtistCount sort script. However, most of the time we want to make sure that the variables we use in a subroutine don’t accidentally clobber the other variables used in the script.

Any variable inside of a my() is “local” to the current subroutine. If another variable outside of the subroutine has the same name, that other variable won’t affect the “my” variable, and the “my” variable won’t affect that wider variable.

It is always a good idea to automatically “my” any variables a subroutine uses, unless you specifically want to be referencing outside variables.

The characters “$#” in front of a variable name count up the number of items in that array. More specifically, it gives you the current highest item in that simple array. If the array currently has three items in it, the current highest item number is 2, and that’s what “$#” will give you. If the array has one item in it, the current highest item number is 0, and that’s what “$#” will give you.

So we have different if blocks depending on whether there are no items in the list (negative one), one item, two items, or three or more items.

Instead of using “eq” to check what $#items is equal to, we are using two equal signs. Perl uses “eq” and “ne” for comparing text. It uses “==” and “!=” for comparing numbers. This is important because Perl doesn’t care whether a variable is text or is a number until you ask it to make the comparison. Go back to your “compare” script and type:

./compare 10 2

You should get:

Text compare: -1

Number compare: 1

Alphabetically, 10 comes before 2. Numerically, 2 comes before 10. With a text compare “10.0” will not equal “10”. But numerically, 10.0 will equal 10. Use the correct operator depending on whether you want to compare as text or compare as a number.

Here, we are comparing as numbers.

The final “else” has a few new things in it also. The pop function is the same as shift except that it takes an item off of the end of the array instead of the beginning.

Those are periods between the “join(...)” function and the text in quotes. If you want to add two numbers together, you use “+”. But if you want to add two strings to each other you use a period. This is also sometimes called concatenation.

Change

$validFormats = join(", ", @validFormats);

to

$validFormats = englishJoin(", ", "or", @validFormats);

And now:

./show --format wriggling stand songs.txt

Format must be raw, simple, or summary.

So, now it works, and it will work for any future formats that we add. We also have a new subroutine available if we need a more readable join for any list.

Format conversions

It is now very easy to add new formats. One common use of Perl is to convert data into HTML. Our song listings could just as easily be turned into HTML table rows for insertion into an HTML table.

First, add a new format called “html” to @validFormats.

@validFormats = ("raw", "simple", "html", "summary");

Second, add a new “elsif” to the part of the script that displays the data:

} elsif ($format eq "html") {
print "<tr><td>$song</td><td>$album</td><td>$artist</td></tr>\n";

Now, repeat some of your previous searches, but ask for the format to be html instead. The data will be displayed in rows that could be included as part of a web page:

./show --album yellow --song girl --format html songs.txt

<tr><td>Young Girl Blues</td><td>Mellow Yellow</td><td>Donovan</td></tr>
<tr><td>Dirty little girl</td><td>Goodbye Yellow Brick Road</td><td>Elton John</td></tr>
<tr><td>All the girls love Alice</td><td>Goodbye Yellow Brick Road</td><td>Elton John</td></tr>

If your web server supports server-side includes, you can automatically include this in your web page. Write it to a file using “>” redirection and include that file.

The current script

#!/usr/bin/perl
#Search for songs in a file of the following tab-separated data:
# title, duration, artist, album, year, rating, rip date, track position, genre


#options for the --format switch
@validFormats = ("raw", "simple", "html", "summary");
$validFormats = englishJoin(", ", "or", @validFormats);

#strip off the command-line switches and act on or remember them
while ($ARGV[0] =~ /^--(.+)/) {
$switch = $1;

#pull this switch off of the front of the list
shift;

#if they ask for help, do it and exit
if ($switch eq "help") {
help();
exit;
} elsif ($switch eq "case") {
$sensitive = 1;
} elsif ($switch eq "reverse") {
$reverse = 1;
} elsif ($switch eq "limit") {
$limit = shift;
if ($limit !~ /^[1-9][0-9]*$/) {
print "\nYou must limit to a number, such as '33' or '2'.\n\n";
help();
exit;
}
} elsif ($switch eq "format") {
$format = shift;
if (!grep(/^$format$/, @validFormats)) {
print "\nFormat must be $validFormats.\n\n";
help();
exit;
}
} else {
print "\nI do not understand the option '$switch'.\n\n";
help();
exit;
}
}

#the first item on the command line is what we're searching for
if ($searchFor = shift) {
while (<>) {
#split out the song, duration, artist, and album
($song, $duration, $artist, $album) = split("\t");

if ($sensitive) {
$matched = /$searchFor/;
} else {
$matched = /$searchFor/i;
}

#reverse the match if we want non-matching lines
if ($reverse) {
$matched = !$matched;
}

#print the information if this line is one we want
if ($matched) {
$matches++;
if ($format eq "raw") {
print;
} elsif ($format eq "html") {
print "<tr><td>$song</td><td>$album</td><td>$artist</td></tr>\n";
} elsif ($format eq "summary") {
$artists{$artist}++;
} else {
print "$song ($album, by $artist)\n";
}
}
last if $limit && $matches >= $limit;
}

if (%artists) {
@artists = keys %artists;
@artists = sort byArtistCount @artists;
foreach $artist (@artists) {
$artistCount = $artists{$artist};
print "$artist: $artistCount\n";
}
}
} else {
help();
}

#describe how this script is used
sub help {
print "Syntax: show <search text> [song files]\n";
print "\tSearch for <search text> in the song file. If no song file is specified\n"; print "\t'show' will expect it on standard input.\n";
print "\tA song file is a tab-delimited file with:\n";
print "\ttitle, duration, artist, album, year, rating, rip date, track position, genre\n";
print "\t--help: print this help text\n";
print "\t--case: be sensitive to upper and lower case\n";
print "\t--reverse: filter out songs that contain the search text\n";
print "\t--limit x: limit to x results\n";
print "\t--format <$validFormats>: choose format for results\n";
}

sub byArtistCount {
return $artists{$b} <=> $artists{$a};
}

sub englishJoin {
my($punctuation) = shift;
my($conjunction) = shift;
my(@items) = @_;

my($joined, $finalItem);

if ($#items == -1) {
$joined = "";
} elsif ($#items == 0) {
$joined = $items[0];
} elsif ($#items == 1) {
$joined = join(" $conjunction ", @items);
} else {
$finalItem = pop(@items);
$joined = join($punctuation, @items) . "$punctuation$conjunction $finalItem";
}

return $joined;
}