GIF89a;
=head2 Quoting metacharacters Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters: $pattern =~ s/(\W)/\\$1/g; (If C is set, then this depends on the current locale.) Today it is more common to use the quotemeta() function or the C<\Q> metaquoting escape sequence to disable all metacharacters' special meanings like this: /$unquoted\Q$quoted\E$unquoted/ Beware that if you put literal backslashes (those not inside interpolated variables) between C<\Q> and C<\E>, double-quotish backslash interpolation may lead to confusing results. If you I to use literal backslashes within C<\Q...\E>, consult L. C and C<\Q> are fully described in L. =head2 Extended Patterns Perl also defines a consistent extension syntax for features not found in standard tools like B and B. The syntax for most of these is a pair of parentheses with a question mark as the first thing within the parentheses. The character after the question mark indicates the extension. The stability of these extensions varies widely. Some have been part of the core language for many years. Others are experimental and may change without warning or be completely removed. Check the documentation on an individual feature to verify its current status. A question mark was chosen for this and for the minimal-matching construct because 1) question marks are rare in older regular expressions, and 2) whenever you see one, you should stop and "question" exactly what is going on. That's psychology.... =over 4 =item C<(?#text)> X<(?#)> A comment. The text is ignored. If the C modifier enables whitespace formatting, a simple C<#> will suffice. Note that Perl closes the comment as soon as it sees a C<)>, so there is no way to put a literal C<)> in the comment. =item C<(?adlupimsx-imsx)> =item C<(?^alupimsx)> X<(?)> X<(?^)> One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by C<->) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). This is particularly useful for dynamic patterns, such as those read in from a configuration file, taken from an argument, or specified in a table somewhere. Consider the case where some patterns want to be case-sensitive and some do not: The case-insensitive ones merely need to include C<(?i)> at the front of the pattern. For example: $pattern = "foobar"; if ( /$pattern/i ) { } # more flexible: $pattern = "(?i)foobar"; if ( /$pattern/ ) { } These modifiers are restored at the end of the enclosing group. For example, ( (?i) blah ) \s+ \g1 will match C in any case, some spaces, and an exact (I!) repetition of the previous word, assuming the C modifier, and no C modifier outside this group. These modifiers do not carry over into named subpatterns called in the enclosing group. In other words, a pattern such as C<((?i)(?&NAME))> does not change the case-sensitivity of the "NAME" pattern. Any of these modifiers can be set to apply globally to all regular expressions compiled within the scope of a C. See L. Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately after the C<"?"> is a shorthand equivalent to C. Flags (except C<"d">) may follow the caret to override it. But a minus sign is not legal with it. Note that the C, C, C, C, and C modifiers are special in that they can only be enabled, not disabled, and the C, C, C, and C modifiers are mutually exclusive: specifying one de-specifies the others, and a maximum of one (or two C's) may appear in the construct. Thus, for example, C<(?-p)> will warn when compiled under C; C<(?-d:...)> and C<(?dl:...)> are fatal errors. Note also that the C modifier is special in that its presence anywhere in a pattern has a global effect. =item C<(?:pattern)> X<(?:)> =item C<(?adluimsx-imsx:pattern)> =item C<(?^aluimsx:pattern)> X<(?^:)> This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make backreferences as "()" does. So @fields = split(/\b(?:a|b|c)\b/) is like @fields = split(/\b(a|b|c)\b/) but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to. Any letters between C> and C<:> act as flags modifiers as with C<(?adluimsx-imsx)>. For example, /(?s-i:more.*than).*million/i is equivalent to the more verbose /(?:(?s-i)more.*than).*million/i Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately after the C<"?"> is a shorthand equivalent to C. Any positive flags (except C<"d">) may follow the caret, so (?^x:foo) is equivalent to (?x-ims:foo) The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (C), modified by any flags specified. The caret allows for simpler stringification of compiled regular expressions. These look like (?^:pattern) with any non-default flags appearing between the caret and the colon. A test that looks at such stringification thus doesn't need to have the system default flags hard-coded in it, just the caret. If new flags are added to Perl, the meaning of the caret's expansion will change to include the default for those flags, so the test will still work, unchanged. Specifying a negative flag after the caret is an error, as the flag is redundant. Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is to match at the beginning. =item C<(?|pattern)> X<(?|)> X This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0. Capture groups are numbered from left to right, but inside this construct the numbering is restarted for each branch. The numbering within each branch will be as normal, and any groups following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture groups in it. This construct is useful when you want to capture one of a number of alternative matches. Consider the following pattern. The numbers underneath show in which group the captured content will be stored. # before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 Be careful when using the branch reset pattern in combination with named captures. Named captures are implemented as being aliases to numbered groups holding the captures, and that interferes with the implementation of the branch reset pattern. If you are using named captures in a branch reset pattern, it's best to use the same names, in the same order, in each of the alternations: /(?| (? x ) (? y ) | (? z ) (? w )) /x Not doing so may lead to surprises: "12" =~ /(?| (? \d+ ) | (? \D+))/x; say $+ {a}; # Prints '12' say $+ {b}; # *Also* prints '12'. The problem here is that both the group named C<< a >> and the group named C<< b >> are aliases for the group belonging to C<< $1 >>. =item Look-Around Assertions X X X X Look-around assertions are zero-width patterns which match a specific pattern without including it in C<$&>. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Look-behind matches text up to the current match position, look-ahead matches text following the current match position. =over 4 =item C<(?=pattern)> X<(?=)> X X A zero-width positive look-ahead assertion. For example, C\w+(?=\t)/> matches a word followed by a tab, without including the tab in C<$&>. =item C<(?!pattern)> X<(?!)> X X A zero-width negative look-ahead assertion. For example C matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind. If you are looking for a "bar" that isn't preceded by a "foo", C(?!foo)bar/> will not do what you want. That's because the C<(?!foo)> is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use look-behind instead (see below). =item C<(?<=pattern)> C<\K> X<(?<=)> X X X<\K> A zero-width positive look-behind assertion. For example, C(?<=\t)\w+/> matches a word that follows a tab, without including the tab in C<$&>. Works only for fixed-width look-behind. There is a special form of this construct, called C<\K>, which causes the regex engine to "keep" everything it had matched prior to the C<\K> and not include it in C<$&>. This effectively provides variable-length look-behind. The use of C<\K> inside of another look-around assertion is allowed, but the behaviour is currently not well defined. For various reasons C<\K> may be significantly more efficient than the equivalent C<< (?<=...) >> construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string. For instance s/(foo)bar/$1/g; can be rewritten as the much more efficient s/foo\Kbar//g; =item C<(? X<(? X X A zero-width negative look-behind assertion. For example C(? matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind. =back =item C<(?'NAME'pattern)> =item C<< (?pattern) >> X<< (?) >> X<(?'NAME')> X X A named capture group. Identical in every respect to normal capturing parentheses C<()> but for the additional fact that the group can be referred to by name in various regular expression constructs (like C<\g{NAME}>) and can be accessed by name after a successful match via C<%+> or C<%->. See L for more details on the C<%+> and C<%-> hashes. If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match. The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. B While the notation of this construct is the same as the similar function in .NET regexes, the behavior is not. In Perl the groups are numbered sequentially regardless of being named or not. Thus in the pattern /(x)(?y)(z)/ $+{foo} will be the same as $2, and $3 will contain 'z' instead of the opposite which is what a .NET regex hacker might expect. Currently NAME is restricted to simple identifiers only. In other words, it must match C^[_A-Za-z][_A-Za-z0-9]*\z/> or its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> may be used instead of C<< (?pattern) >>; however this form does not support the use of single quotes as a delimiter for the name. =item C<< \k >> =item C<< \k'NAME' >> Named backreference. Similar to numeric backreferences, except that the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. It is an error to refer to a name not defined by a C<< (?) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its C is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. This feature can be used together with the special variable C<$^N> to capture the results of submatches in variables without having to keep track of the number of nested parentheses. For example: $_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n"; Inside the C<(?{...})> block, C<$_> refers to the string the regular expression is matching against. You can also use C to know what is the current position of matching within this string. The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
, and C modifiers are special in that they can only be enabled, not disabled, and the C, C, C, and C modifiers are mutually exclusive: specifying one de-specifies the others, and a maximum of one (or two C's) may appear in the construct. Thus, for example, C<(?-p)> will warn when compiled under C; C<(?-d:...)> and C<(?dl:...)> are fatal errors. Note also that the C modifier is special in that its presence anywhere in a pattern has a global effect. =item C<(?:pattern)> X<(?:)> =item C<(?adluimsx-imsx:pattern)> =item C<(?^aluimsx:pattern)> X<(?^:)> This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make backreferences as "()" does. So @fields = split(/\b(?:a|b|c)\b/) is like @fields = split(/\b(a|b|c)\b/) but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to. Any letters between C> and C<:> act as flags modifiers as with C<(?adluimsx-imsx)>. For example, /(?s-i:more.*than).*million/i is equivalent to the more verbose /(?:(?s-i)more.*than).*million/i Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately after the C<"?"> is a shorthand equivalent to C. Any positive flags (except C<"d">) may follow the caret, so (?^x:foo) is equivalent to (?x-ims:foo) The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (C), modified by any flags specified. The caret allows for simpler stringification of compiled regular expressions. These look like (?^:pattern) with any non-default flags appearing between the caret and the colon. A test that looks at such stringification thus doesn't need to have the system default flags hard-coded in it, just the caret. If new flags are added to Perl, the meaning of the caret's expansion will change to include the default for those flags, so the test will still work, unchanged. Specifying a negative flag after the caret is an error, as the flag is redundant. Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is to match at the beginning. =item C<(?|pattern)> X<(?|)> X This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0. Capture groups are numbered from left to right, but inside this construct the numbering is restarted for each branch. The numbering within each branch will be as normal, and any groups following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture groups in it. This construct is useful when you want to capture one of a number of alternative matches. Consider the following pattern. The numbers underneath show in which group the captured content will be stored. # before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 Be careful when using the branch reset pattern in combination with named captures. Named captures are implemented as being aliases to numbered groups holding the captures, and that interferes with the implementation of the branch reset pattern. If you are using named captures in a branch reset pattern, it's best to use the same names, in the same order, in each of the alternations: /(?| (? x ) (? y ) | (? z ) (? w )) /x Not doing so may lead to surprises: "12" =~ /(?| (? \d+ ) | (? \D+))/x; say $+ {a}; # Prints '12' say $+ {b}; # *Also* prints '12'. The problem here is that both the group named C<< a >> and the group named C<< b >> are aliases for the group belonging to C<< $1 >>. =item Look-Around Assertions X X X X Look-around assertions are zero-width patterns which match a specific pattern without including it in C<$&>. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Look-behind matches text up to the current match position, look-ahead matches text following the current match position. =over 4 =item C<(?=pattern)> X<(?=)> X X A zero-width positive look-ahead assertion. For example, C\w+(?=\t)/> matches a word followed by a tab, without including the tab in C<$&>. =item C<(?!pattern)> X<(?!)> X X A zero-width negative look-ahead assertion. For example C matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind. If you are looking for a "bar" that isn't preceded by a "foo", C(?!foo)bar/> will not do what you want. That's because the C<(?!foo)> is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use look-behind instead (see below). =item C<(?<=pattern)> C<\K> X<(?<=)> X X X<\K> A zero-width positive look-behind assertion. For example, C(?<=\t)\w+/> matches a word that follows a tab, without including the tab in C<$&>. Works only for fixed-width look-behind. There is a special form of this construct, called C<\K>, which causes the regex engine to "keep" everything it had matched prior to the C<\K> and not include it in C<$&>. This effectively provides variable-length look-behind. The use of C<\K> inside of another look-around assertion is allowed, but the behaviour is currently not well defined. For various reasons C<\K> may be significantly more efficient than the equivalent C<< (?<=...) >> construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string. For instance s/(foo)bar/$1/g; can be rewritten as the much more efficient s/foo\Kbar//g; =item C<(? X<(? X X A zero-width negative look-behind assertion. For example C(? matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind. =back =item C<(?'NAME'pattern)> =item C<< (?pattern) >> X<< (?) >> X<(?'NAME')> X X A named capture group. Identical in every respect to normal capturing parentheses C<()> but for the additional fact that the group can be referred to by name in various regular expression constructs (like C<\g{NAME}>) and can be accessed by name after a successful match via C<%+> or C<%->. See L for more details on the C<%+> and C<%-> hashes. If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match. The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. B While the notation of this construct is the same as the similar function in .NET regexes, the behavior is not. In Perl the groups are numbered sequentially regardless of being named or not. Thus in the pattern /(x)(?y)(z)/ $+{foo} will be the same as $2, and $3 will contain 'z' instead of the opposite which is what a .NET regex hacker might expect. Currently NAME is restricted to simple identifiers only. In other words, it must match C^[_A-Za-z][_A-Za-z0-9]*\z/> or its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> may be used instead of C<< (?pattern) >>; however this form does not support the use of single quotes as a delimiter for the name. =item C<< \k >> =item C<< \k'NAME' >> Named backreference. Similar to numeric backreferences, except that the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. It is an error to refer to a name not defined by a C<< (?) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its C is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. This feature can be used together with the special variable C<$^N> to capture the results of submatches in variables without having to keep track of the number of nested parentheses. For example: $_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n"; Inside the C<(?{...})> block, C<$_> refers to the string the regular expression is matching against. You can also use C to know what is the current position of matching within this string. The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
modifier is special in that its presence anywhere in a pattern has a global effect. =item C<(?:pattern)> X<(?:)> =item C<(?adluimsx-imsx:pattern)> =item C<(?^aluimsx:pattern)> X<(?^:)> This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make backreferences as "()" does. So @fields = split(/\b(?:a|b|c)\b/) is like @fields = split(/\b(a|b|c)\b/) but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to. Any letters between C> and C<:> act as flags modifiers as with C<(?adluimsx-imsx)>. For example, /(?s-i:more.*than).*million/i is equivalent to the more verbose /(?:(?s-i)more.*than).*million/i Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately after the C<"?"> is a shorthand equivalent to C. Any positive flags (except C<"d">) may follow the caret, so (?^x:foo) is equivalent to (?x-ims:foo) The caret tells Perl that this cluster doesn't inherit the flags of any surrounding pattern, but uses the system defaults (C), modified by any flags specified. The caret allows for simpler stringification of compiled regular expressions. These look like (?^:pattern) with any non-default flags appearing between the caret and the colon. A test that looks at such stringification thus doesn't need to have the system default flags hard-coded in it, just the caret. If new flags are added to Perl, the meaning of the caret's expansion will change to include the default for those flags, so the test will still work, unchanged. Specifying a negative flag after the caret is an error, as the flag is redundant. Mnemonic for C<(?^...)>: A fresh beginning since the usual use of a caret is to match at the beginning. =item C<(?|pattern)> X<(?|)> X This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0. Capture groups are numbered from left to right, but inside this construct the numbering is restarted for each branch. The numbering within each branch will be as normal, and any groups following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture groups in it. This construct is useful when you want to capture one of a number of alternative matches. Consider the following pattern. The numbers underneath show in which group the captured content will be stored. # before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 Be careful when using the branch reset pattern in combination with named captures. Named captures are implemented as being aliases to numbered groups holding the captures, and that interferes with the implementation of the branch reset pattern. If you are using named captures in a branch reset pattern, it's best to use the same names, in the same order, in each of the alternations: /(?| (? x ) (? y ) | (? z ) (? w )) /x Not doing so may lead to surprises: "12" =~ /(?| (? \d+ ) | (? \D+))/x; say $+ {a}; # Prints '12' say $+ {b}; # *Also* prints '12'. The problem here is that both the group named C<< a >> and the group named C<< b >> are aliases for the group belonging to C<< $1 >>. =item Look-Around Assertions X X X X Look-around assertions are zero-width patterns which match a specific pattern without including it in C<$&>. Positive assertions match when their subpattern matches, negative assertions match when their subpattern fails. Look-behind matches text up to the current match position, look-ahead matches text following the current match position. =over 4 =item C<(?=pattern)> X<(?=)> X X A zero-width positive look-ahead assertion. For example, C\w+(?=\t)/> matches a word followed by a tab, without including the tab in C<$&>. =item C<(?!pattern)> X<(?!)> X X A zero-width negative look-ahead assertion. For example C matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind. If you are looking for a "bar" that isn't preceded by a "foo", C(?!foo)bar/> will not do what you want. That's because the C<(?!foo)> is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. Use look-behind instead (see below). =item C<(?<=pattern)> C<\K> X<(?<=)> X X X<\K> A zero-width positive look-behind assertion. For example, C(?<=\t)\w+/> matches a word that follows a tab, without including the tab in C<$&>. Works only for fixed-width look-behind. There is a special form of this construct, called C<\K>, which causes the regex engine to "keep" everything it had matched prior to the C<\K> and not include it in C<$&>. This effectively provides variable-length look-behind. The use of C<\K> inside of another look-around assertion is allowed, but the behaviour is currently not well defined. For various reasons C<\K> may be significantly more efficient than the equivalent C<< (?<=...) >> construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string. For instance s/(foo)bar/$1/g; can be rewritten as the much more efficient s/foo\Kbar//g; =item C<(? X<(? X X A zero-width negative look-behind assertion. For example C(? matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind. =back =item C<(?'NAME'pattern)> =item C<< (?pattern) >> X<< (?) >> X<(?'NAME')> X X A named capture group. Identical in every respect to normal capturing parentheses C<()> but for the additional fact that the group can be referred to by name in various regular expression constructs (like C<\g{NAME}>) and can be accessed by name after a successful match via C<%+> or C<%->. See L for more details on the C<%+> and C<%-> hashes. If multiple distinct capture groups have the same name then the $+{NAME} will refer to the leftmost defined group in the match. The forms C<(?'NAME'pattern)> and C<< (?pattern) >> are equivalent. B While the notation of this construct is the same as the similar function in .NET regexes, the behavior is not. In Perl the groups are numbered sequentially regardless of being named or not. Thus in the pattern /(x)(?y)(z)/ $+{foo} will be the same as $2, and $3 will contain 'z' instead of the opposite which is what a .NET regex hacker might expect. Currently NAME is restricted to simple identifiers only. In other words, it must match C^[_A-Za-z][_A-Za-z0-9]*\z/> or its Unicode extension (see L), though it isn't extended by the locale (see L). B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?PENAMEEpattern) >> may be used instead of C<< (?pattern) >>; however this form does not support the use of single quotes as a delimiter for the name. =item C<< \k >> =item C<< \k'NAME' >> Named backreference. Similar to numeric backreferences, except that the group is designated by name and not number. If multiple groups have the same name then it refers to the leftmost defined group in the current match. It is an error to refer to a name not defined by a C<< (?) >> earlier in the pattern. Both forms are equivalent. B In order to make things easier for programmers with experience with the Python or PCRE regex engines, the pattern C<< (?P=NAME) >> may be used instead of C<< \k >>. =item C<(?{ code })> X<(?{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This zero-width assertion evaluates any embedded Perl code. It always succeeds, and its C is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. This feature can be used together with the special variable C<$^N> to capture the results of submatches in variables without having to keep track of the number of nested parentheses. For example: $_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n"; Inside the C<(?{...})> block, C<$_> refers to the string the regular expression is matching against. You can also use C to know what is the current position of matching within this string. The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
is not interpolated. Currently, the rules to determine where the C ends are somewhat convoluted. This feature can be used together with the special variable C<$^N> to capture the results of submatches in variables without having to keep track of the number of nested parentheses. For example: $_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n"; Inside the C<(?{...})> block, C<$_> refers to the string the regular expression is matching against. You can also use C to know what is the current position of matching within this string. The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
ends are somewhat convoluted. This feature can be used together with the special variable C<$^N> to capture the results of submatches in variables without having to keep track of the number of nested parentheses. For example: $_ = "The brown fox jumps over the lazy dog"; /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i; print "color = $color, animal = $animal\n"; Inside the C<(?{...})> block, C<$_> refers to the string the regular expression is matching against. You can also use C to know what is the current position of matching within this string. The C is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
is properly scoped in the following sense: If the assertion is backtracked (compare L<"Backtracking">), all changes introduced after Cization are undone, so that $_ = 'a' x 8; m< (?{ $cnt = 0 }) # Initialize $cnt. ( a (?{ local $cnt = $cnt + 1; # Update $cnt, # backtracking-safe. }) )* aaaa (?{ $res = $cnt }) # On success copy to # non-localized location. >x; will set C<$res = 4>. Note that after the match, C<$cnt> returns to the globally introduced value, because the scopes that restrict C operators are unwound. This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> switch. If I used in this way, the result of evaluation of C is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
is put into the special variable C<$^R>. This happens immediately, so C<$^R> can be used from other C<(?{ code })> assertions inside the same regular expression. The assignment to C<$^R> above is properly localized, so the old value of C<$^R> is restored if the assertion is backtracked; compare L<"Backtracking">. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see Lmsixpodual">). This restriction is due to the wide-spread and remarkably convenient custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; Before Perl knew how to execute interpolated code within a pattern, this operation was completely safe from a security point of view, although it could raise an exception from an illegal pattern. If you turn on the C, though, it is no longer secure, so you should only do so if you are also using taint checking. Better yet, use the carefully constrained evaluation within a Safe compartment. See L for details about both these mechanisms. B: Use of lexical (C) variables in these blocks is broken. The result is unpredictable and will make perl unstable. The workaround is to use global (C) variables. B: In perl 5.12.x and earlier, the regex engine was not re-entrant, so interpolated code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Invoking the regex engine in these blocks would make perl unstable. =item C<(??{ code })> X<(??{})> X X X B: This extended regular expression feature is considered experimental, and may be changed without notice. Code executed that has side effects may not perform identically from version to version due to the effect of future optimisations in the regex engine. This is a "postponed" regular subexpression. The C is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered a regular expression and matched as if it were inserted instead of this construct. Note that this means that the contents of capture groups defined inside an eval'ed pattern are not available outside of the pattern, and vice versa, there is no way for the inner pattern returned from the code block to refer to a capture group defined outside. (The code block itself can use C<$1>, etc., to refer to the enclosing pattern's capture groups.) Thus, ('a' x 100)=~/(??{'(.)' x 100})/ B match, it will B set $1. The C is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
is not interpolated. As before, the rules to determine where the C ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I
ends are currently somewhat convoluted. The following pattern matches a parenthesized group: $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x; See also C<(?PARNO)> for a different, more efficient way to accomplish the same task. For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous C pragma has been used (see L), or the variables contain results of the C operator (see LSTRINGEmsixpodual">). In perl 5.12.x and earlier, because the regex engine was not re-entrant, delayed code could not safely invoke the regex engine either directly with C or C), or indirectly with functions such as C. Recursing deeper than 50 times without consuming any input string will result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. =item C<(?PARNO)> C<(?-PARNO)> C<(?+PARNO)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> X X X X Similar to C<(??{ code })> except it does not involve compiling any code, instead it treats the contents of a capture group as an independent pattern that must match at the current position. Capture groups contained by the pattern will have the value as determined by the outermost recursion. PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture group to recurse to. C<(?R)> recurses to the beginning of the whole pattern. C<(?0)> is an alternate syntax for C<(?R)>. If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture groups and positive ones following. Thus C<(?-1)> refers to the most recently declared group, and C<(?+1)> indicates the next group to be declared. Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed groups B included. The following pattern matches a function foo() which may contain balanced parentheses as the argument. $re = qr{ ( # paren group 1 (full function) foo ( # paren group 2 (parens) \( ( # paren group 3 (contents of parens) (?: (?> [^()]+ ) # Non-parens without backtracking | (?2) # Recurse to start of paren group 2 )* ) \) ) ) }x; If the pattern was used as follows 'foo(bar(baz)+baz(bop))'=~/$re/ and print "\$1 = $1\n", "\$2 = $2\n", "\$3 = $3\n"; the output produced should be the following: $1 = foo(bar(baz)+baz(bop)) $2 = (bar(baz)+baz(bop)) $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a fatal error. Recursing deeper than 50 times without consuming any input string will also result in a fatal error. The maximum depth is compiled into perl, so changing it requires a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C construct for later use: my $parens = qr/(\((?:[^()]++|(?-1))*+\))/; if (/foo $parens \s+ + \s+ bar $parens/x) { # do something here... } B that this pattern does not behave the same way as the equivalent PCRE or Python construct of the same form. In Perl you can backtrack into a recursed group, in PCRE and Python the recursed into group is treated as atomic. Also, modifiers are resolved at compile time, so constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect how the sub-pattern will be processed. =item C<(?&NAME)> X<(?&NAME)> Recurse to a named subpattern. Identical to C<(?PARNO)> except that the parenthesis to recurse to is determined by name. If multiple parentheses have the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. B In order to make things easier for programmers with experience with the Python or PCRE regex engines the pattern C<< (?P>NAME) >> may be used instead of C<< (?&NAME) >>. =item C<(?(condition)yes-pattern|no-pattern)> X<(?()> =item C<(?(condition)yes-pattern)> Conditional expression. Matches C if C yields a true value, matches C otherwise. A missing pattern always matches. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), a look-ahead/look-behind/evaluate zero-width assertion, a name in angle brackets or single quotes (which is valid if a group with the given name matched), or the special symbol (R) (true when evaluated inside of recursion or eval). Additionally the R may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. Here's a summary of the possible predicates: =over 4 =item (1) (2) ... Checks if the numbered capturing group has matched something. =item () ('NAME') Checks if a group with the given name has matched something. =item (?=...) (?!...) (?<=...) (?, this predicate checks to see if we're executing directly inside of the leftmost group with a given name (this is the same logic used by C<(?&NAME)> to disambiguate). It does not check the full stack, but only the name of the innermost active recursion. =item (DEFINE) In this case, the yes-pattern is never directly executed, and no no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. See below for details. =back For example: m{ ( \( )? [^()]+ (?(1) \) ) }x matches a chunk of non-parentheses, possibly included in parentheses themselves. A special form is the C<(DEFINE)> predicate, which never executes its yes-pattern directly, and does not allow a no-pattern. This allows one to define subpatterns which will be executed only by the recursion mechanism. This way, you can define a set of regular expression rules that can be bundled into any pattern you choose. It is recommended that for this usage you put the DEFINE block at the end of the pattern, and that you name any subpatterns defined within it. Also, it's worth noting that patterns defined this way probably will not be as efficient, as the optimiser is not very clever about handling them. An example of how this might be used is as follows: /(?(?&NAME_PAT))(?(?&ADDRESS_PAT)) (?(DEFINE) (?....) (?....) )/x Note that capture groups matched inside of recursion are not accessible after the recursion returns, so the extra layer of capturing groups is necessary. Thus C<$+{NAME_PAT}> would not be defined even though C<$+{NAME}> would be. Finally, keep in mind that subpatterns created inside a DEFINE block count towards the absolute and relative number of captures, so this: my @captures = "a" =~ /(.) # First capture (?(DEFINE) (? 1 ) # Second capture )/x; say scalar @captures; Will output 2, not 1. This is particularly important if you intend to compile the definitions with the C operator, and later interpolate them in another pattern. =item C<< (?>pattern) >> X X X X An "independent" subexpression, one which matches the substring that a I C would match if anchored at the given position, and it matches I