in the
original string.
To escape the special meaning of C<.>, we use C<\Q>:
$string = "Placido P. Octopus";
$regex = "P.";
$string =~ s/\Q$regex/Polyp/;
# $string is now "Placido Polyp Octopus"
The use of C<\Q> causes the <.> in the regex to be treated as a
regular character, so that C matches a C followed by a dot.
=head2 What is C really for?
X X
(contributed by brian d foy)
The C option for regular expressions (documented in L and
L) tells Perl to compile the regular expression only once.
This is only useful when the pattern contains a variable. Perls 5.6
and later handle this automatically if the pattern does not change.
Since the match operator C, the substitution operator C,
and the regular expression quoting operator C are double-quotish
constructs, you can interpolate variables into the pattern. See the
answer to "How can I quote a variable to use in a regex?" for more
details.
This example takes a regular expression from the argument list and
prints the lines of input that match it:
my $pattern = shift @ARGV;
while( <> ) {
print if m/$pattern/;
}
Versions of Perl prior to 5.6 would recompile the regular expression
for each iteration, even if C<$pattern> had not changed. The C
would prevent this by telling Perl to compile the pattern the first
time, then reuse that for subsequent iterations:
my $pattern = shift @ARGV;
while( <> ) {
print if m/$pattern/o; # useful for Perl < 5.6
}
In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the C
option. It doesn't hurt, but it doesn't help either. If you want any
version of Perl to compile the regular expression only once even if
the variable changes (thus, only using its initial value), you still
need the C.
You can watch Perl's regular expression engine at work to verify for
yourself if Perl is recompiling a regular expression. The C
modifier.
=head2 What good is C<\G> in a regular expression?
X<\G>
You use the C<\G> anchor to start the next match on the same
string where the last match left off. The regular
expression engine cannot skip over any characters to find
the next match with this anchor, so C<\G> is similar to the
beginning of string anchor, C<^>. The C<\G> anchor is typically
used with the C flag. It uses the value of C
as the position to start the next match. As the match
operator makes successive matches, it updates C with the
position of the next character past the last match (or the
first character of the next match, depending on how you like
to look at it). Each string has its own C value.
Suppose you want to match all of consecutive pairs of digits
in a string like "1122a44" and stop matching when you
encounter non-digits. You want to match C<11> and C<22> but
the letter shows up between C<22> and C<44> and you want
to stop at C. Simply matching pairs of digits skips over
the C and still matches C<44>.
$_ = "1122a44";
my @pairs = m/(\d\d)/g; # qw( 11 22 44 )
If you use the C<\G> anchor, you force the match after C<22> to
start with the C. The regular expression cannot match
there since it does not find a digit, so the next match
fails and the match operator returns the pairs it already
found.
$_ = "1122a44";
my @pairs = m/\G(\d\d)/g; # qw( 11 22 )
You can also use the C<\G> anchor in scalar context. You
still need the C flag.
$_ = "1122a44";
while( m/\G(\d\d)/g ) {
print "Found $1\n";
}
After the match fails at the letter C, perl resets C
and the next match on the same string starts at the beginning.
$_ = "1122a44";
while( m/\G(\d\d)/g ) {
print "Found $1\n";
}
print "Found $1 after while" if m/(\d\d)/g; # finds "11"
You can disable C resets on fail with the C flag, documented
in L and L. Subsequent matches start where the last
successful match ended (the value of C) even if a match on the
same string has failed in the meantime. In this case, the match after
the C loop starts at the C (where the last match stopped),
and since it does not use any anchor it can skip over the C to find
C<44>.
$_ = "1122a44";
while( m/\G(\d\d)/gc ) {
print "Found $1\n";
}
print "Found $1 after while" if m/(\d\d)/g; # finds "44"
Typically you use the C<\G> anchor with the C flag
when you want to try a different match if one fails,
such as in a tokenizer. Jeffrey Friedl offers this example
which works in 5.004 or later.
while (<>) {
chomp;
PARSER: {
m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
}
}
For each line, the C loop first tries to match a series
of digits followed by a word boundary. This match has to
start at the place the last match left off (or the beginning
of the string on the first match). Since C uses the C flag, if the string does not match that
regular expression, perl does not reset pos() and the next
match starts at the same position to try a different
pattern.
=head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant?
X X X
While it's true that Perl's regular expressions resemble the DFAs
(deterministic finite automata) of the egrep(1) program, they are in
fact implemented as NFAs (non-deterministic finite automata) to allow
backtracking and backreferencing. And they aren't POSIX-style either,
because those guarantee worst-case behavior for all cases. (It seems
that some people prefer guarantees of consistency, even when what's
guaranteed is slowness.) See the book "Mastering Regular Expressions"
(from O'Reilly) by Jeffrey Friedl for all the details you could ever
hope to know on these matters (a full citation appears in
L).
=head2 What's wrong with using grep in a void context?
X
The problem is that grep builds a return list, regardless of the context.
This means you're making Perl go to the trouble of building a list that
you then just throw away. If the list is large, you waste both time and space.
If your intent is to iterate over the list, then use a for loop for this
purpose.
In perls older than 5.8.1, map suffers from this problem as well.
But since 5.8.1, this has been fixed, and map is context aware - in void
context, no lists are constructed.
=head2 How can I match strings with multibyte characters?
X X
X X X
Starting from Perl 5.6 Perl has had some level of multibyte character
support. Perl 5.8 or later is recommended. Supported multibyte
character repertoires include Unicode, and legacy encodings
through the Encode module. See L, L,
and L.
If you are stuck with older Perls, you can do Unicode with the
L module, and character conversions using the
L and L modules. If you are using
Japanese encodings, you might try using the jperl 5.005_03.
Finally, the following set of approaches was offered by Jeffrey
Friedl, whose article in issue #5 of The Perl Journal talks about
this very matter.
Let's suppose you have some weird Martian encoding where pairs of
ASCII uppercase letters encode single Martian letters (i.e. the two
bytes "CV" make a single Martian letter, as do the two bytes "SG",
"VS", "XX", etc.). Other bytes represent single characters, just like
ASCII.
So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the
nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'.
Now, say you want to search for the single character C. Perl
doesn't know about Martian, so it'll find the two bytes "GX" in the "I
am CVSGXX!" string, even though that character isn't there: it just
looks like it is because "SG" is next to "XX", but there's no real
"GX". This is a big problem.
Here are a few ways, all painful, to deal with it:
# Make sure adjacent "martian" bytes are no longer adjacent.
$martian =~ s/([A-Z][A-Z])/ $1 /g;
print "found GX!\n" if $martian =~ /GX/;
Or like this:
my @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g;
# above is conceptually similar to: my @chars = $text =~ m/(.)/g;
#
foreach my $char (@chars) {
print "found GX!\n", last if $char eq 'GX';
}
Or like this:
while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded
if ($1 eq 'GX') {
print "found GX!\n";
last;
}
}
Here's another, slightly less painful, way to do it from Benjamin
Goldberg, who uses a zero-width negative look-behind assertion.
print "found GX!\n" if $martian =~ m/
(? X X X X<\Q, regex>
X<\E, regex> X
(contributed by brian d foy)
We don't have to hard-code patterns into the match operator (or
anything else that works with regular expressions). We can put the
pattern in a variable for later use.
The match operator is a double quote context, so you can interpolate
your variable just like a double quoted string. In this case, you
read the regular expression as user input and store it in C<$regex>.
Once you have the pattern in C<$regex>, you use that variable in the
match operator.
chomp( my $regex = );
if( $string =~ m/$regex/ ) { ... }
Any regular expression special characters in C<$regex> are still
special, and the pattern still has to be valid or Perl will complain.
For instance, in this pattern there is an unpaired parenthesis.
my $regex = "Unmatched ( paren";
"Two parens to bind them all" =~ m/$regex/;
When Perl compiles the regular expression, it treats the parenthesis
as the start of a memory match. When it doesn't find the closing
parenthesis, it complains:
Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3.
You can get around this in several ways depending on our situation.
First, if you don't want any of the characters in the string to be
special, you can escape them with C before you use the string.
chomp( my $regex = );
$regex = quotemeta( $regex );
if( $string =~ m/$regex/ ) { ... }
You can also do this directly in the match operator using the C<\Q>
and C<\E> sequences. The C<\Q> tells Perl where to start escaping
special characters, and the C<\E> tells it where to stop (see L
for more details).
chomp( my $regex = );
if( $string =~ m/\Q$regex\E/ ) { ... }
Alternately, you can use C, the regular expression quote operator (see
L for more details). It quotes and perhaps compiles the pattern,
and you can apply regular expression flags to the pattern.
chomp( my $input = );
my $regex = qr/$input/is;
$string =~ m/$regex/ # same as m/$input/is;
You might also want to trap any errors by wrapping an C block
around the whole thing.
chomp( my $input = );
eval {
if( $string =~ m/\Q$input\E/ ) { ... }
};
warn $@ if $@;
Or...
my $regex = eval { qr/$input/is };
if( defined $regex ) {
$string =~ m/$regex/;
}
else {
warn $@;
}
=head1 AUTHOR AND COPYRIGHT
Copyright (c) 1997-2010 Tom Christiansen, Nathan Torkington, and
other authors as noted. All rights reserved.
This documentation is free; you can redistribute it and/or modify it
under the same terms as Perl itself.
Irrespective of its distribution, all code examples in this file
are hereby placed into the public domain. You are permitted and
encouraged to use this code in your own programs for fun
or for profit as you see fit. A simple comment in the code giving
credit would be courteous but is not required.