More About Regular Expressions
Contents:
Character Classes
General Quantifiers
Anchors
Memory Parentheses
Precedence
Exercises
In the previous chapter, we saw the beginnings of what regular expressions can do. Here we'll see some of their other common features.
Character Classes
A character class, a list of possible characters inside square brackets ([]
), matches any single character from within the class. It matches just one single character, but that one character may be any of the ones listed.
For example, the character class [abcwxyz]
may match any one of those seven characters. For convenience, you may specify a range of characters with a hyphen (-
), so that class may also be written as [a-cw-z]
. That didn't save much typing, but it's more usual to make a character class like [a-zA-Z]
, to match any one letter out of that set of 52.[173] You may use the same character shortcuts as in any double-quotish string to define a character, so the class [\000-\177]
matches any seven-bit ASCII character.[174]
[173]Notice that those 52 don't include letters like Å and É and Î and Ø and Ü. But when Unicode processing is available, that particular character range is noticed and enhanced to automatically do the right thing.
[174]At least, if you use ASCII and not EBCDIC.
Of course, a character class will be just part of a full pattern; it will never stand on its own in Perl. For example, you might see code that says something like this:
$_ = "The HAL-9000 requires authorization to continue."; if (/HAL-[0-9]+/) { print "The string mentions some model of HAL computer.\n"; }
Sometimes, it's easier to specify the characters left out, rather than the ones within the character class. A caret ("^
") at the start of the character class negates it. That is, [^def]
will match any single character except one of those three. And [^n\-z]
matches any character except for n
, hyphen, or z
. (Note that the hyphen is backslashed, because it's special inside a character class. But the first hyphen in /HAL-[0-9]+/
doesn't need a backslash, because hyphens aren't special outside a character class.)
Character Class Shortcuts
Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9]
, may be abbreviated as d
. Thus, the pattern from the example about HAL could be written /HAL-\d+/
instead.
The shortcut w
is a so-called "word" character: [A-Za-z0-9_]
. If your "words" are made up of ordinary letters, digits, and underscores, you'll be happy with this. Most of the rest of us have words made up of ordinary letters, hyphens, and apostrophes,[175] and we'd like to change this. As of this writing, the Perl developers are working on it, but it's not available yet.[176] So use this one only when you want ordinary letters, digits, and underscores.
[175]At least, in usual English we do. In other languages, you may have different components of words. And when looking at ASCII-encoded English text, we have the problem that the single quote and the apostrophe are the same character, so it's not possible in isolation to tell whether
cats'
is a word with an apostrophe or a word at the end of a quotation. This is probably one reason that computers haven't been able to take over the world yet.
[176]Except to a limited (but nevertheless useful) extent in connection with locales; see the perllocale manpage.
Of course, w
doesn't match a "word"; it merely matches a single "word" character. To match an entire word, though, the plus modifier is handy. A pattern like /fred \w+ barney/
will match fred
and a space, then a "word", then a space and barney
. That is, it'll match if there's one word[177] between fred
and barney
, set off by single spaces.
[177]We're going to stop saying "word" in quotes so much; you know by now that these letter-digit-underscore words are the ones we mean.
As you may have noticed in that previous example, it might be handy to be able to match spaces more flexibly. The s
shortcut is good for whitespace; it's the same as [\f\t\n\r ]
. That is, it's the same as a class containing the five whitespace characters form-feed, tab, newline, carriage return, and the space character itself. These are the characters that merely move the printing position around; they don't use any ink. Still, like the other shortcuts we've just seen, s
matches just a single character from the class, so it's usual to use either s*
for any amount of whitespace (including none at all), or s+
for one or more whitespace characters. (In fact, it's rare to see s
without one of those quantifiers.) Since all of those whitespace characters look about the same to us humans, we can treat them all in the same way with this shortcut.
Negating the Shortcuts
Sometimes you may want the opposite of one of these three shortcuts. That is, you may want [^\d]
, [^\w]
, or [^\s]
, meaning a nondigit character, a nonword character, or a nonwhitespace character. That's easy enough to accomplish by using their uppercase counterparts: D
, W
, or S
. These match any character that their counterpart would not match.
Any of these shortcuts will work either in place of a character class (standing on their own in a pattern), or inside the square brackets of a larger character class. That means that you could now use /[\dA-Fa-f]+/
to match hexadecimal (base 16) numbers, which use letters ABCDEF
(or the same letters in lowercase) as additional digits.
Another compound character class is [\d\D]
, which means any digit, or any non-digit. That is to say, any character at all! This is a common way to match any character, even a newline. (As opposed to , which matches any character except a newline.) And then there's the totally useless [^\d\D]
, which matches anything that's not either a digit or a non-digit. Right -- nothing!