Regular Expressions

Unix and the Mac

Before we go any further, you’ll have to understand some Unix terms, and some Macintosh “internals”.

Filepath

A “filepath” is the “path” to a “file”. The “path” is the file, the folder the file is in, the folder that folder is in, etc. Each folder is separated from the next folder by a colon. If you have a file called “Stacey’s Workbook” in a folder called “Stacey”, on your hard drive, which is called “My Macintosh”, the filepath for Stacey’s Workbook is My Macintosh:Stacey:Stacey’s Workbook.

Under the Macintosh, the “base” of the filepath is the volume. If you have a “DPF” volume, any files inside “DPF” will be referenced as “DPF:filename”.

If you have to deal with Unix—and you do, because it’s the language of the web—you’ll see slashes instead of colons. For example, all the “referers” you see will look like “http://www.somewhere.com/this/is/a/filepath.html”.

Quoting Special Characters

Regular expressions use special characters to mean special things. For example, you’ll use a period (.) to mean any possible character. But sometimes a period is just a period. To tell the expression not to use the special meaning of a special character, precede the character with the backslash. That’s called “quoting” the character.

If you want a period to really be a period inside your regular expression, type it as “\.”.

Regular Expressions

Imagining the Unknown

What is a regular expression?

A “regular expression” is a way of saying more than one thing at a time--talking out of both sides of your face. Using regular expressions, you can tell Paul Bunyan to only look at the files ending in “.gif”, for example, or to ignore any hosts beginning with “192.55.87”.

In most Macintosh programs, the regular expressions are case insensitive (BBEdit is an exception). If you ever try to use this knowledge in Unix, remember that the default is for them to be case sensitive, meaning that “.gif” is different than “.GIF”. Scripts written in Perl will use “Perl” regular expressions. See the Perl Regular Expressions for more detailed information about regular expressions in Perl. Here, I’m only going to cover what you’re likely to need for basic analysis.

Regular expressions use “special characters” to tell whether or not to ignore or use a certain host or page.

Regular Expression Special Characters

The Alpha and the Omega

One of the most important special characters you’ll be using is the character for “beginning” and the character for “end”. Normally, when you specify a string of characters, the expression “matches” them if they occur anywhere in the page or host. If you want the expression to only match if the string occurs at the beginning of the page or host, begin your regular expression with the caret, “^”. If you want the expression to match only if the string occurs at the end of the page/host, end your expression with the dollar sign, “$”.

Yes, you can use both of them: then the page has to both begin and end with your strings. In other words, it has to be your string.

  • ^:DPF:Members.html$ matches only the page “:DPF:Members.html”--because the regular expression specifies that this string of characters has to be both at the beginning and at the end.
  • ^:DPF: matches any page inside the “DPF” folder. This regular expression requires only that the string of characters “:DPF:” appear at the beginning of the filepath in question.
  • \.html$ matches any page that ends in “.html”. (Note the “quoting” of the period--otherwise, it would have meant any file ending in “html”.)

Any Single Character

.

Any Number of Characters

?, *, +

Classes of Characters

[]

This or That

|