Finding the Nth Occurrence of a Match

Finding the N^th Occurrence of a Match

Problem

You want to find the N ^th match in a string, not just the first one. For example, you'd like to find the word preceding the third occurrence of "fish":

One fish two fish red fish blue fish

Solution

Use the /g modifier in a while loop, keeping count of matches:

$WANT = 3; $count = 0; while (/(\w+)\s+fish\b/gi) {
 if (++$count == $WANT) {
 print "The third fish is a $1 one.\n"; # Warning: don't `last' out of this loop
}
} The third fish is a red one.

Or use a repetition count and repeated pattern like this:

/(?:\w+\s+fish\s+){2}(\w+)\s+fish/i;

As explained in the chapter introduction, using the /g modifier in scalar context creates something of a progressive match, useful in while loops. This is commonly used to count the number of times a pattern matches in a string:

# simple way with while loop $count = 0; while ($string =~ /PAT/g) {
 $count++; # or whatever you'd like to do here
}
# same thing with trailing while $count = 0; $count++ while $string =~ /PAT/g; # or with for loop for ($count = 0; $string =~ /PAT/g; $count++) {
}
# Similar, but this time count overlapping matches $count++ while $string =~ /(?=PAT)/g;

To find the N^th match, it's easiest to keep your own counter. When you reach the appropriate N, do whatever you care to. A similar technique could be used to find every N^th match by checking for multiples of N using the modulus operator. For example, (++$count % 3) == 0 would be every third match.

If this is too much bother, you can always extract all matches and then hunt for the ones you'd like.

$pond = 'One fish two fish red fish blue fish'; # using a temporary @colors = ($pond =~ /(\w+)\s+fish\b/gi); # get all matches $color = $colors[2]; # then the one we want # or without a temporary array $color = ( $pond =~ /(\w+)\s+fish\b/gi )[2]; # just grab element 3 print "The third fish in the pond is $color.\n"; The third fish in the pond is red.

Or finding all even-numbered fish:

$count = 0; $_ = 'One fish two fish red fish blue fish'; @evens = grep {
 $count++ % 2 == 1
}
/(\w+)\s+fish\b/gi;
print "Even numbered fish are @evens.\n"; Even numbered fish are two blue.

For substitution, the replacement value should be a code expression that returns the proper string. Make sure to return the original as a replacement string for the cases you aren't interested in changing. Here we fish out the fourth specimen and turn it into a snack:

$count = 0; s{ \b # makes next \w more efficient ( \w+ ) # this is what we'll be changing ( \s+ fish \b ) }{ if (++$count == 4) {
 "sushi" . $2;
}
else {
 $1 . $2;
}
}gex; One fish two fish red fish sushi fish

Picking out the last match instead of the first one is a fairly common task. The easiest way is to skip the beginning part greedily. After /.*\b(\w+)\s+fish\b/, for example, the $1 variable would have the last fish.

Another way to get arbitrary counts is to make a global match in list context to produce all hits, then extract the desired element of that list:

$pond = 'One fish two fish red fish blue fish swim here.'; $color = ( $pond =~ /\b(\w+)\s+fish\b/gi )[-1];
print "Last fish is $color.\n"; Last fish is blue.

If you need to express this same notion of finding the last match in a single pattern without /g, you can do so with the negative lookahead assertion (?!THING). When you want the last match of arbitrary pattern A, you find A followed by any amount of not A through the end of the string. The general construct is A(?!.*A)*$, which can be broken up for legibility:

m{ A # find some pattern A (?! # mustn't be able to find .* # something A # and A ) $ # through the end of the string }x

That leaves us with this approach for selecting the last fish:

$pond = 'One fish two fish red fish blue fish swim here.'; if ($pond =~ m{ \b ( \w+) \s+ fish \b (?! .* \b fish \b ) }six ) {
 print "Last fish is $1.\n";
}
else {
 print "Failed!\n";
}
Last fish is blue.

This approach has the advantage that it can fit in just one pattern, which makes it suitable for similar situations as shown in . It has its disadvantages, though. It's obviously much harder to read and understand, although once you learn the formula, it's not too bad. But it also runs more slowly though - around twice as slowly on the data set tested above.