More Powerful Regular Expressions
After already reading three chapters about regular expressions, you know that they're a powerful feature in the core of Perl. But there are even more features that the Perl developers have added; we'll see some of the most important ones in this section. At the same time, you'll see a little more about the internal operation of the regular expression engine.
Non-greedy Quantifiers
The four quantifiers we've already seen (in "More About Regular Expressions") are all greedy. That means that they match as much as they can, only to reluctantly give some back if that's necessary to allow the overall pattern to succeed. Here's an example: Suppose you're using the pattern /fred.+barney/ on the string fred and barney went bowling last night. Of course, we know that the regular expression will match that string, but let's see how it goes about it.[364]
[364]The regular expression engine makes a few optimizations that make the true story different than we tell it here, and those optimizations change from one release of Perl to the next. You shouldn't be able to tell from the functionality that it's not doing as we say, though. If you want to know how it really works, you should read the latest source code. Be sure to submit patches for any bugs you find.
First, of course, the subpattern fred matches the identical literal string. The next part of the pattern is the +, which matches any character except newline, at least one time. But the plus quantifier is greedy; it prefers to match as much as possible. So it immediately matches all of the rest of the string, including the word night. (This may surprise you, but the story isn't over yet.)
Now the subpattern barney would like to match, but it can't -- we're at the end of the string. But since the + could still be successful even if it matched one fewer character, it reluctantly gives back the letter t at the end of the string. (It's greedy, but it wants the whole pattern to succeed even more than it wants to match everything all by itself.)
The subpattern barney tries again to match, and still can't. So the + gives back the letter h and lets it try again. One character after another, the + gives back what it matched until finally it gives up all of the letters of barney. Now, finally, the subpattern barney can match, and the overall match succeeds.
Regular expression engines do a lot of backtracking like that, trying every different way of fitting the pattern to the string until one of them succeeds, or until none of them has.[365] But as you could see from this example, that can involve a lot of backtracking, as the quantifier gobbles up too much of the string and has to be forced to return some of it.
[365]In fact, some regular expression engines try every different way, even continuing on after they find one that fits. But Perl's regular expression engine is primarily interested in whether the pattern can or cannot match, so finding even one match means that the engine's work is done. Again, see Jeffrey Friedl's Mastering Regular Expressions.
For each of the greedy quantifiers, though, there's also a non-greedy quantifier available. Instead of the plus (+), we can use the non-greedy quantifier +?, which matches one or more times (just as the plus does), except that it prefers to match as few times as possible, rather than as many as possible. Let's see how that new quantifier works when the pattern is rewritten as /fred.+?barney/.
Once again, fred matches right at the start. But this time the next part of the pattern is +?, which would prefer to match no more than one character, so it matches just the space after fred. The next subpattern is barney, but that can't match here (since the string at the current position begins with and barney...). So the +? reluctantly matches the a and lets the rest of the pattern try again. Once again, barney can't match, so the +? accepts the letter n and so on. Once the +? has matched five characters, barney can match, and the pattern is a success.
There was still some backtracking, but since the engine had to go back and try again just a few times, it should be a big improvement in speed. Well, it's an improvement if you'll generally find barney near fred. If your data often had fred near the start of the string and barney only at the end, the greedy quantifier might be a faster choice. In the end, the speed of the regular expression depends upon the data.
But the non-greedy quantifiers aren't just about efficiency. Although they'll always match (or fail to match) the same strings as their greedy counterparts, they may match different amounts of the strings. For example, suppose you had some HTML-like[366] text, and you want to remove all of the tags <BOLD> and </BOLD>, leaving their contents intact. Here's the text:
[366]Once again, we aren't using real HTML because you can't correctly parse HTML with simple regular expressions. If you really need to work with HTML or a similar markup language, use a module that's made to handle the complexities.
I'm talking about the cartoon with Fred and <BOLD>Wilma</BOLD>!
And here's a substitution to remove those tags. But what's wrong with it?
s#<BOLD>(.*)</BOLD>#$1#g;
The problem is that the star is greedy.[367] What if the text had said this instead?
[367]There's another possible problem: we should have used the
/smodifier as well, since the end tag may be on a different line than the start tag. It's a good thing that this is just an example; if we were writing something like this for real, we would have taken our own advice and used a well-written module.
I thought you said Fred and <BOLD>Velma</BOLD>, not <BOLD>Wilma</BOLD>
In that case, the pattern would match from the first <BOLD> to the last </BOLD>, leaving intact the ones in the middle of the line. Oops! Instead, we want a non-greedy quantifier. The non-greedy form of star is *?, so the substitution now looks like this:
s#<BOLD>(.*?)</BOLD>#$1#g;
And it does the right thing.
Since the non-greedy form of the plus was +? and the non-greedy form of the star was *?, you've probably realized that the other two quantifiers look similar. The non-greedy form of any curly-brace quantifier looks the same, but with a question mark after the closing brace, like {5,10}? or {8,}?.[368] And even the question-mark quantifier has a non-greedy form: ??. That matches either once or not at all, but it prefers not to match anything.
[368]In theory, there's also a non-greedy quantifier form that specifies an exact number, like
{3}?. But since that says to match exactly three of the preceding item, it has no flexibility to be either greedy or non-greedy.
Matching Multiple-line Text
Classic regular expressions were used to match just single lines of text. But since Perl can work with strings of any length, Perl's patterns can match multiple lines of text as easily as single lines. Of course, you have to include an expression that holds more than one line of text. Here's a string that's four lines long:
$_ = "I'm much better\nthan Barney is\nat bowling,\nWilma.\n";
Now, the anchors ^ and $ are normally anchors for the start and end of the whole string (see "Anchors" in "More About Regular Expressions"). But the /m regular expression option lets them match at internal newlines as well (think m for multiple lines). This makes them anchors for the start and end of each line, rather than the whole string. So this pattern can match:
print "Found 'wilma' at start of line\n" if /^wilma\b/im;
Similarly, you could do a substitution on each line in a multiline string. Here, we read an entire file into one variable,[369] then add the file's name as a prefix at the start of each line:
[369]Hope it's a small one. The file, that is, not the variable.
open FILE, $filename or die "Can't open '$filename': $!"; my $lines = join '', <FILE>; $lines =~ s/^/$filename: /gm;