L

libvmod-pcre2

Varnish Module (VMOD) to access the PCRE2 regular expression library.

skipped 3ffe3995 Fix a typo · by Geoff Simmons

vmod_pcre2

access the pcre2 regular expression library

Manual section: 3

SYNOPSIS

import pcre2 [from "path"] ;

# object interface
new OBJECT = pcre2.regex(STRING pattern [, compile options])
BOOL <OBJ>.match(STRING subject [, match options])
STRING <OBJ>.backref(INT ref)
STRING <OBJ>.namedref(STRING name)
STRING <OBJ>.sub(STRING subject, STRING replacement [, match options]
                 [, substitution options])
BOOL <OBJ>.info_bool(ENUM)
INT <OBJ>.info_int(ENUM)
STRING <OBJ>.info_str(ENUM)

# function interface
BOOL pcre2.match(STRING pattern, STRING subject [, compile options]
                 [, match options])
STRING pcre2.backref(INT ref)
STRING pcre2.namedref(STRING name)
STRING pcre2.sub(STRING pattern, STRING subject, STRING replacement
                 [, compile options] [, match options]
                 [, substitution options])

# library configuration
BOOL pcre2.config_bool(ENUM)
INT pcre2.config_int(ENUM)
STRING pcre2.config_str(ENUM)

DESCRIPTION

This Varnish Module (VMOD) provides access to the PCRE2 regular expression library. PCRE2 is the Perl-compatible regular expression library with a revised API, the successor to the PCRE library that implements native regexen in Varnish VCL. See pcre2(3) and the manuals that it references for details about the PCRE2 library.

PCRE2, by itself, does not change regular expressions from the perspective of the end user -- the syntax and semantics of patterns and pattern matching remained largely the same at the time PCRE2 was introduced. The new library is a refactoring of the internal API, which is transparent to the user, and the VMOD endeavors to make use of the new internal features advantageously for VCL.

Some of the differences between the VMOD and native VCL regexen are:

  • The VMOD provides methods and functions to retrieve back references after a match that are easier to use than the idiom with the regsub function that is necessary in native VCL. It also provides the means to retrieve references to named capturing groups.
  • The functional interface makes it possible to use patterns that are not known until runtime.
  • PCRE2 introduces a new native substitution function, similar to the regsub and regsuball functions in VCL, except that the substitution syntax is different and provides more features.
  • Parameters that limit the depth of recursion and backtracking in match operations, which are set globally in Varnish, can be set for individual matches in the VMOD.
  • The VMOD can support matching against UTF-8 strings, if it is running against a PCRE2 library that was built to support Unicode.
  • The VMOD exposes considerably more functionality of the underlying library. VCL provides a general-purpose regular expression facility -- PCRE could be easily replaced as its regex engine. The VMOD is meant to be specific to PCRE2, and makes a full range of its features available in VCL.
  • The VMOD provides methods and functions that allow you to inspect properties of patterns and of the library. These are not likely to be useful on the fast path of production deployments, and are not optimized for that. But they may be useful during development to debug and optimize regex matching.

Since the introduction of PCRE2, the original PCRE library is being maintained for bugfixes, but development of new features and optimizations are only being done for PCRE2. So the VMOD will make it possible to take advantage of improvements in the library as they are released.

Here are some simple usage examples:

# regex objects are created in vcl_init, and the regular expressions
# are compiled when VCL is loaded.
sub vcl_init {
    # A regex to match the "foo" cookie, and capture its value.
    new foo = pcre2.regex("\bfoo=([^;,\s]+\b)");

    # A regex to match a URL beginning with the prefix "/bar/", and
    # capture its suffix.
    new bar = pcre2.regex("^/bar/(.+)");
}

sub vcl_recv {
    # If the cookie header contains "foo", then assign its value
    # to another header.
    if (foo.match(req.http.Cookie)) {
        set req.http.X-Foo-Value = foo.backref(1);
    }

    # If the URL begins with "/bar/", then replace the prefix with
    # "/baz/quux/".
    if (bar.match(req.url)) {
        set req.url = "/baz/quux/" + bar.backref(1);
    }
}

Object and functional interfaces

The VMOD provides regular expression operations by way of the regex object interface and a functional interface. For regex objects, the pattern is compiled at VCL initialization time, and the compiled pattern is re-used for each invocation of its methods. Compilation failures (due to errors in the pattern) cause failure at initialization time, and the VCL fails to load. The .backref() and .namedref() methods refer back to the last invocation of the .match() method for the same object. The .sub() method also re-uses an object's compiled pattern.

The functional interface provides the same set of operations, but the pattern is compiled at runtime on each invocation of the match() and sub() functions (and then discarded). Compilation failures are reported as errors in the Varnish log. The backref() and namedref() functions refer back to the last invocation of the match() function, for any pattern.

Compiling a pattern at runtime on each invocation is considerably more costly than re-using a compiled pattern. So for patterns that are fixed and known at VCL initialization, the object interface should be used. The functional interface should only be used for patterns whose contents are not known until runtime.

Compile, match and substitution options

The VMOD has unusually long lists of parameters for its methods and functions -- over 40 for the sub() function, for example. But nearly all of these have default values, and it is only necessary to specify options in VCL that differ from the defaults.

The optional parameters affect the interpretations of patterns and the operation of matches and substitutions, and come in three groups:

  • Compile options, used wherever a pattern is compiled: in the regex object constructor, and the match() and sub() functions.
  • Match options, used wherever a match is performed: in the match and sub methods and functions.
  • Substitution options, used in the sub method and function.

The options have call scope, meaning that they are evaluated only once for each invocation of a function or method at its particular location in the VCL source, on the first invocation after the VCL instance is loaded. The options are then cached and re-used for all subsequent invocations, and cannot be changed (until a new VCL instance is loaded).

Compile options

Compile options define properties of patterns. See pcre2pattern(3) for details of PCRE2 pattern syntax, and pcre2syntax(3) for a quick reference.

The default value of all of the BOOL options is false.

See also JIT compilation and matching below.

allow_empty_class
If true, then a pattern may include [] to denote an empty character class. This, in part, supports compatibility with regexen in ECMAscript (also known as Javascript). By default, a closing square bracket after an opening one is interpreted as a character in the class (and ] must appear later in the pattern).
alt_bsux

(Referring to "backslash-u" and "backslash-x".) If true, then three escape sequences are interpreted differently (for compatibility with ECMAscript):

  • \U matches an upper case U character. By default, \U causes a compile error.
  • \u matches a lower case u, unless it is followed by four hexadecimal digits, in which case the hex number identifies the code point to be matched. By default, \u causes a compile error.
  • \x matches a lower case x, unless it is followed by four hex digits, in which case it identifies the code point to match. By default, \x must always be followed by zero to two hex digits to identify a one-byte character (for example, \xz matches binary zero followed by z).
alt_circumflex
If true, and if multiline is also true, then the ^ meta-character matches after a newline appearing as the last character in a string. By default, ^ does not match after a terminating newline.
alt_verbnames
If true, then backslash processing may be applied to verb names in verb sequences such as (*MARK:NAME), so that the name can, for example, include a closing parenthesis as \) or between \Q and \E. By default, no processing is applied to verb names, and they end at the first closing parenthesis (regardless of any backslash).
anchored
If true, then the pattern is anchored, meaning that it is constrained to match at the starting point of a string. This may also be achieved with constructs in the pattern.
bsr
(For "backslash-R".) If this ENUM value is set, then it determines which sequences are matched by \R. If set to UNICODE, then \R matches any UTF-8 newline sequence. If set to ANYCRLF, then it matches CR (carriage return, or \r), LF (linefeed, or \n), or CR followed by LF. By default, \R matches the sequence chosen when the PCRE2 library was built, which can be determined from config_str(BSR) (the default default is Unicode). See pcre2pattern(3) for details about \R.
caseless
If true, then matches for this pattern are case-insensitive. This may also be achieved with (?i) in the pattern.
dollar_endonly
If true, then the $ metacharacter matches only at the end of a string. By default, $ also matches before newlines within the string (but not before newlines that come immediately after a newline). dollar_endonly is ignored when multiline is true.
dotall
If true, then the . metacharacter matches any character, including newlines. But it only ever matches one character, even if newlines are coded as CRLF. By default, dots do not match newlines. The effect of dotall can also be achieved with (?s) in the pattern.
dupnames
If true, then the names used for named capturing groups are not required to be unique. By default, names for capturing groups may only be used once.
extended

If true, then pattern syntax is permitted to contain constructs that serve as self-documentation:

  • Most whitespace is ignored, except when escaped or inside a character class (and a few other exceptions detailed in pcre2api(3)).

  • All characters between an unescaped # and the next newline are ignored, and can be used as comments.

    For example, this is a self-documenting declaration of a pattern that matches IPv6 addresses:

    new ipv6 = pcre2.regex(extended=true, caseless=true, pattern=
    {"^(?!:)                 # colon disallowed at start
      (?:                    # start of item
        (?: [0-9a-f]{1,4} |  # 1-4 hex digits or
        (?(1)0 | () ) )      # fail if null previously matched
        :                    # followed by colon
      ){1,7}                 # end item; 1-7 of them required
      [0-9a-f]{1,4} $        # final hex number at end of string
      (?(1)|.)               # there was an empty component
    "});
    

The effect of extended can also be achieved with the (?x) option in a pattern.

firstline
If true, an unanchored pattern must match before or at the first newline in the subject string (though the matched text may continue over a newline). If the offset_limit option is also set for a match, then the match must occur within the offset limit and in the first line.
locale

If locale is set to a string matching a locale that is available on the system on which Varnish is running, then that locale is used for the pattern to determine which characters are letters, digits, upper and lower case, and so forth. Hence this option affects the interpretation of constructs such as \w and \d, the caseless option, and so on. This only applies to single-byte characters.

If locale is set to a string that is not recognized as a locale, then compilation fails.

By default, PCRE2 uses tables established when the library is built to recognize character properties; normally, these only recognize ASCII characters.

Quoting pcre2api(3):

The use of locales with Unicode is discouraged. If you are handling characters with code points greater than 128, you should either use Unicode support, or use locales, but not try to mix the two.
match_unset_backref
If true, then a back reference to an unset capturing group matches an empty string; thus (\1)(a) successfully matches a. This makes the pattern similar to an ECMAscript pattern. By default, an unset backref causes the matcher to backtrack, and possibly fail.
max_pattern_len
If this INT value is greater than 0, then it sets a maximum length for the pattern string to be compiled. If the pattern is longer, then compilation fails.
multiline
If true, then the ^ and $ meta-characters match immediately after and before internal newlines in the subject string, respectively, in addition to matching at the start and end of the string. By default, the start and end anchors only match at the beginning and end of the string, regardless of internal newlines. The effect of multiline can also be achieved with (?m) in the pattern.
never_backslash_c
If true, then \C may not be used in a pattern, and causes compile failure. \C always matches exactly one byte, even in UTF mode, and may lead to unpredictable effects if it matches in the middle of a multibyte UTF-8 character. \C may have been prohibited by a build-time option in the library, which can be discovered by calling config_bool(NEVER_BACKSLASH_C).
never_ucp
If true, then Unicode properties are not used to interpret \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes in the pattern. It is then impossible to activate this facility by including (*UCP) at the start of the pattern. If never_ucp and ucp are both set to true, then the compile fails.
newline

If this ENUM value is set, it determines which characters are to be matched as newlines in the pattern. It can be set to:

  • CR (carriage return)
  • LF (linefeed)
  • CRLF (CR followed by LF)
  • ANYCRLF (CR, LF or CRLF)
  • UNICODE (any Unicode line-ending sequences)

By default, the newline sequence chosen for the PCRE2 library when it was built is used, which can be determined from config_str(NEWLINE).

no_auto_capture
If true, then numbered capturing groups are disabled in the pattern. Any opening parenthesis not followed by ? is then interpreted as if it were followed by ?: (that is, it forms a non-capturing group). Named capturing groups can still be used, and these also acquire a capturing group number, so namedref and backref can still be used (but only for the named groups).
no_auto_possess
If true, then the "auto-possessification" optimization is disabled for the pattern, which for example interprets a+b as a++b, using the "possessive quantifier", to prevent backtracks into a+ that can never be successful. If the option is true, then the full unoptimized search is run.
no_start_optimize
If true, then some optimizations for the start of the match are disabled. This has the effect that certain constructs in the pattern, such as (*COMMIT) or (*MARK), are evaluated at every possible starting position in the string, while they may have been skipped when the optimizations are applied. Thus this option may change the result of match calls in patterns that include such constructs. See pcre2api(3) for details.
no_utf_check
If this option and utf are both true, then validity checks to determine if the pattern is a valid UTF string are disabled. This may save CPU usage and time for the match() and sub() functions, which compile patterns on every invocation, and check UTF strings for validity by default. But you should only do so if you are sure that the inputs are valid, because running matches in UTF mode against invalid strings is undefined, and may cause Varnish to crash or loop. By default, invalid UTF strings in the pattern cause the compile to fail in UTF mode. See pcre2unicode(3) for details.
parens_nest_limit
If this INT value is greater than 0, it sets the maximum depth of parenthesis nesting in a pattern. It applies to all kinds of parentheses, not just captruing groups. The limit prevents patterns from using too much of the stack when compiled, and may be useful for the functional interface, for which patterns are compiled at runtime. By default, the nesting limit set for the PCRE2 library at build time is imposed, which is returned by config_int(PARENSLIMIT).
ucp
If this option and utf are both true, then Unicode properties are used to interpret \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes in the pattern. The same effect can be achieved by including (*UCP) at the start of the pattern. By default, only ASCII characters are considered for these constructs, which is faster than considering Unicode properties. If Unicode was disabled at build time for the PCRE2 library, which can be discovered by calling config_bool(UNICODE), then the compile fails when this option is true. Compiles also fail if this option and never_ucp are both true. See pcre2unicode(3) for details about Unicode character properties.
ungreedy
If true, then the "greediness" of quantifiers in the pattern is inverted, so that they are not greedy by default, but become greedy when followed by ?. The same effect can be achieved by including (?U) in the pattern.
use_offset_limit
This option must be set to true for a pattern if you intend to use the offset_limit parameter in match and substitution operations to limit how far a string is searched for an unanchored match. If an offset_limit is set for an invocation of the match or sub methods or functions, but this option was not set to true for the pattern, then then the match fails.
utf
If true, then both the pattern and the strings against which it is matched are processed as UTF-8 strings. If Unicode support was disabled when the PCRE2 library was built, which can be determined from config_bool(UNICODE), then the compile fails when utf is true. See pcre2unicode(3) for details about Unicode support in PCRE2.

Match options

Match options affect the operation of matching in the match and sub methods and functions. By default, all of the BOOL options are false. The INT options are 0 by default (meaning that they are ignored, and the global defaults hold). The INT options MAY NOT be less than 0; if they are, then the match fails.

anchored
If true, then the match is constrained to match at the start of the string, regardless of whether the pattern is anchored. By default, a match is searched for anywhere in the string if the pattern is not anchored.
len
If this INT value is greater than 0, it sets the length of the subject string to be matched. By default, the full string is matched.
match_limit
If this INT value is greater than 0, it sets a limit to the effort used by the PCRE2 matching function to find a match. This can prevent matches from excessive backtracking, if there is a very large search space but a match is never found. It is equivalent to the varnishd parameter pcre_match_limit, except that it applies only to the match operation in which it was set, not globally. The varnishd parameters for PCRE have no effect on this VMOD. By default, the match limit is imposed that was set for the PCRE2 library at build time, which can discovered from config_int(MATCHLIMIT).
not_bol
If true, the first character of the subject is string is not considered to be the beginning of a line, so the ^ metacharacter does not match before it. If the compile option multiline was not set to true for the pattern, then ^ never matches. This option only affects the circumflex metacharacter.
not_eol
If true, the end of the subject string is not considered to be the end of a line, so the $ metacharacter does not match after it. If multiline was not set to true for the pattern, then $ never matches. This option only affects the dollar metacharacter.
not_empty
If true, then the empty string is not a valid match. If the matcher finds an empty match, then it considers other alternatives, and if no other valid matches are found, then the match fails.
not_empty_atstart
If true, then the empty string is not a valid match at the start of the subject string. An empty string match later in the subject is permitted.
no_jit
If true, then the just-in-time matcher is not used, even when the pattern was compiled for JIT. In that case, PCRE2's "traditional" interpretive matcher is used (as is always the case if JIT is not available, or if the pattern was not JIT-compiled). If no_jit is true for an invocation of the match() or sub() functions, which compile a pattern on every call, then the pattern is also not JIT-compiled. See JIT compilation and matching below.
no_utf_check
If true, then the subject is not checked for validity as a UTF-8 string when matched against a pattern for which utf was set to true. This may speed up matching, but should only be done if you are sure that the inputs are valid UTF-8. By default, UTF validity is checked for matches against patterns that were compiled with utf.
offset_limit
If this INT value is greater than 0, it limits how far an unanchored search can advance in the subject string. For example, if the pattern abc is matched against the string "123abc" and the offset limit is less than 3, the match fails. To use this parameter, the compile option use_offset_limit must have been set to true for the pattern at compile time; otherwise the match fails. By default, unanchored matches are searched for until the end of the string.
recursion_limit
If this INT value is greater than 0, then it limits the depth of recursion for matches using the interpretive matcher. It is equivalent to the varnishd parameter pcre_match_limit_recursion, but only applies to the individual match. This limits the depth of recursion and use of the stack for matches that may cause excessive recursion and stack overflow (which usually causes Varnish to crash). The limit is not relevant to the JIT matcher, and is ignored for JIT matching. By default, the recursion limit set for the PCRE2 library at build time applies, which can be determined from config_int(RECURSIONLIMIT).

Substitution options

The sub method and function use all of the match options (since they run a match), and the following additional options. (The sub function also uses the compile options, since it compiles a pattern.)

suball
If true, then the substitution iterates over the subject string and replaces every matching substring, making the substitution similar to the native VCL regsuball function. By default, only the first matching substring is replaced, making the substitution similar to VCL's regsub function.
sub_extended
If true, then an extended syntax is enabled for the replacement string. Details of the replacement syntax are documented for the .sub() method below.
unknown_unset
If true, then references to capturing groups in the replacement string that do not appear in the pattern are treated as unset groups. By default, unknown references cause the substitution to fail. Use this option with care, because it causes misspelled group names or numbers to be silently ignored.
unset_empty
If true, then unset capturing groups (including unknown groups when unknown_unset is also true) are replaced as empty strings. By default, an attempt to insert an unset group causes the substitution to fail.

JIT compilation and matching

PCRE2 supports just-in-time compilation for patterns, and a matcher to go with it. JIT is a heavyweight optimization that may greatly speed up matching, but requires extra processing at pattern compilation time. The VMOD supports JIT if it was enabled for the PCRE2 library when it was built, which can be determined from config_bool(JIT).

If JIT is available, then it is always applied to the compilation of patterns in the regex object constructor. By default it is also applied when patterns are compiled at runtime in the match and sub methods and functions, unless the no_jit option is true. For patterns compiled at runtime, it may be worth it to turn off JIT, if the overhead for JIT-compiles outweighs the advantage of JIT matching.

If JIT is not available, then PCRE2 always uses the interpretive matcher.

Unicode

The VMOD only links to the 8-bit version of PCRE2, and hence can support UTF-8 if Unicode was enabled when the library was built. The VMOD does not support UTF-16 or UTF-32. Thus the term "code unit", as used for Unicode and in the PCRE2 documentation, always refers to one byte.

In UTF mode, characters in patterns and the strings to be matched are interpreted as UTF-8 code points, and hence may correspond to one to four bytes. When UTF is not enabled, characters in patterns and strings are represented by exactly one byte.

See pcre2unicode(3) for the details of PCRE2 Unicode support.

CONTENTS

  • regex(STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL)
  • BOOL match(PRIV_CALL, PRIV_TASK, STRING, STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL, INT, INT, INT, BOOL, BOOL, BOOL, BOOL, BOOL, INT)
  • STRING backref(PRIV_TASK, INT, STRING)
  • STRING namedref(PRIV_TASK, STRING, STRING)
  • STRING sub(PRIV_CALL, PRIV_TASK, STRING, STRING, STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL, INT, INT, INT, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL)
  • BOOL config_bool(ENUM {JIT,STACKRECURSE,UNICODE})
  • STRING config_str(ENUM {BSR,JITTARGET,NEWLINE,UNICODE_VERSION,VERSION})
  • INT config_int(ENUM {LINKSIZE,MATCHLIMIT,PARENSLIMIT,RECURSIONLIMIT})
  • STRING version()

regex

new OBJ = regex(STRING pattern, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0)

Create a regex object from pattern according to the given compile options (or option defaults). If the pattern is invalid, then the VCL will fail to load, and the VCC compiler will emit an error message.

Examples:

sub vcl_init {

    # Match this pattern against the Host header (hence
    # case-insensitively), and capture part of the domain name.
    new domain = pcre2.regex("^www\.([^.]+)\.com$", caseless=true);

    # Match a max-age tag and capture the number.
    new maxage = pcre2.regex("max-age\s*=\s*(\d+)");

    # Group possible subdomains without capturing
    new submatcher = pcre2.regex("^www\.(domain1|domain2)\.com$",
                                 never_capture=true, caseless=true);
}

regex.match

BOOL regex.match(PRIV_CALL, PRIV_TASK, STRING subject, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0)

Return true if the compiled regex matches the subject string, as constrained by the given match options or option defaults.

The match may fail if any of the options are illegal for one of the reasons given above, or if a limit such as the match or recursion limit is reached. In that case, and error message is written to the Varnish log using the VCL_Error tag, and the method returns false.

If subject is undefined, for example if it is set from an unset header variable, then it is assumed to be the empty string. This follows VCL's handling of regex matching when the string to be matched is unset.

Example:

if (domain.match(req.http.Host)) {
   call do_on_match;
}

regex.backref

STRING regex.backref(INT ref, STRING fallback="**BACKREF METHOD FAILED**")

Returns the nth captured subexpression from the most recent successful call of the .match() method for this object in the same client or backend context, or a fallback string in case the capture fails. Backref 0 indicates the entire matched string. Thus this function behaves like the \n in the native VCL functions regsub and regsuball, and the $1, $2 ... variables in Perl. Unlike the regsubs, which limit the backref number to 0 through 9, backref permits any number that identifies a capturing group in the pattern.

Since Varnish client and backend operations run in different threads, .backref() can only refer back to a .match() call in the same thread. Thus a .backref() call in any of the vcl_backend_* subroutines -- the backend context -- refers back to a previous .match() in any of those same subroutines; and a call in any of the other VCL subroutines -- the client context -- refers back to a .match() in the same client context.

After unsuccessful matches, the fallback string is returned for any call to .backref(). The default value of fallback is "**BACKREF METHOD FAILED**". .backref() always fails after a failed match, even if .match() had been called successfully before the failure.

.backref() may also return fallback after a successful match, if no captured group in the matching string corresponds to the backref number. For example, when the pattern (a|(b))c matches the string ac, there is no backref 2, since nothing matches b in the string.

The VCL infix operators ~ and !~ do not affect this method, nor do the functions regsub or regsuball. Nor is it affected by the matches performed by any other method or function in this VMOD, (the match() function or the sub method or function).

.backref() fails, returning fallback and writing an error message to the Varnish log with the VCL_Error tag, under the following conditions (even if a previous match was successful and a substring could have been captured):

  • Any of the match options are illegal (for example, if one of the numeric limits was set to less than 0).
  • The fallback string is undefined.
  • ref (the backref number) is out of range -- if it is less than 0 or larger than the highest number for a capturing group in the pattern.
  • .match() was never called for this object in the task scope prior to calling .backref().

Example:

if (domain.match(req.http.Host)) {
   set req.http.X-Domain = domain.backref(1);
}

regex.namedref

STRING regex.namedref(STRING name, STRING fallback="**NAMEDREF METHOD FAILED**")

Returns the captured subexpression designated by name from the most recent successful call to .match() in the current context (client or backend), or fallback in case of failure. See pcre2pattern(3) for details about the use of named subpatterns in PCRE2 regexen.

Note that a named capturing group can also be referenced as a numbered group -- the named groups are numbered exactly as if the names were not present. So an expression returned by .namedref() will also be returned by .backref() with the appropriate number.

fallback is returned when .namedref() is called after an unsuccessful match. The default fallback is "**NAMEDREF METHOD FAILED**".

Like .backref(), .namedref() is not affected by native VCL regex operations, nor by any other matches performed by methods or functions of the VMOD, except for a prior .match() for the same object.

.namedref() fails, returning fallback and logging a VCL_Error message, if:

  • The fallback string is undefined.
  • name is undefined.
  • There is no such named group.
  • .match() was not called for this object.

Example:

sub vcl_init {
      new domain = pcre2.regex("^www\.(?<domain>[^.]+)\.com$");
}

sub vcl_recv {
      if (domain.match(req.http.Host)) {
         set req.http.X-Domain = domain.namedref("domain");
      }
}

regex.sub

STRING regex.sub(PRIV_CALL, PRIV_TASK, STRING subject, STRING replacement, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)

If the pattern represented by this object matches subject, then return a string formed by replacing the part that was matched by replacement. If the pattern does not match, then return the subject string unchanged. The match and substitution options affect these operations as described above.

This method is similar to the native VCL regsub function, or regsuball when the suball option is true, but the syntax of the replacement string is different. In the replacement string, these sequences can be used to insert strings:

$$
Inserts a dollar character.
$<n> or ${<n>}
Inserts the contents of group <n> captured during the match, where <n> can be a number or a name. The number can be 0 to include the entire matched string. Braces are only required if the following character would be interpreted as part of the number or name.
$*MARK or ${*MARK}
Insert the name of the last (*MARK) encountered in the match.

For example, to rewrite URLs with prefixes of the form "/~<user>" so that their prefix is "/u/<user>" (and leave other URLs unchanged):

sub vcl_init {
    new user = pcre2.regex("/~([^/]+)(.*)", anchored=true);
}

sub vcl_recv {
    set req.url = user.sub(req.url, "/u/${1}${2}");
}

When the sub_extended option is false, only the dollar character is special in the replacement string. When sub_extended is true, the replacement syntax also has these capabilites:

  • Backslashes in the replacement string are interpreted as escapes, and special backslash sequences are interpreted as for PCRE2 patterns. For example, \n denotes newline, and \x{ddd}, where each d is a digit, specifies a character code. A backslash followed by a non-alphanumeric character quotes the character, and \Q and \E can be used to quote a longer sequence.

  • Four additional escape sequences can be used to force the case of inserted letters:

    • \U forces upper case for all of the following text until \E, or to the end of the string if there is no \E.
    • \L through \E or end of string forces lower case.
    • \u and \l force the next character, if it is a letter, to upper and lower case, respectively.

    Case forcing applies to all inserted characters, including those from captured groups and in sequences quoted by \Q through \E.

    Sequences ending in \E do not nest. So for example, "\Uaa\LBB\Ecc\E" results in "AAbbcc", and the final \E has no effect.

  • The "dollar" replacement expressions have an additional capability inspired by Bash to handle unset capturing groups:

    ${<n>:-<string>}

    As with ${<n>}, <n> can be a number or name. If group <n> is set, then its contents are inserted, otherwise <string> is expanded and inserted. <string> may, in turn, include elements of the replacement syntax that are interpreted accordingly.

    ${<n>:+<string1>:<string2}

    If group <n> is set, insert the result of expanding <string1>, otherwise insert the result of expanding <string2>.

    Colons and escapes in the replacement strings can be escaped with backslashes.

For example, to rewrite Host headers of the form www.<sub1>.<sub2>.<tld> to <sub2>.<tld>, and of the form www.<sub>.<tld> to <sub>.<tld>, while also normalizing the header to lower-case, and leaving other Host headers unchanged:

sub vcl_init {
    new hostsub = pcre2.regex(extended=true, pattern={"
                  "^www\.             # www. prefix
                  ([^.]+)             # group 1, "<sub1>"
                  (?:                 # non-capturing parentheses
                    \.([^.]+)         # dot, then group 2, "<sub2>"
                  )?                  # 0 or 1 of group 2
                  \.([^.]+)$          # dot, then group 3, "<tld>"
                  "});
}

sub vcl_recv {
    set req.http.Host = hostsub.sub(req.http.Host, sub_extended=true,
                                    replacement="\L${2:+$2:$1}.$3");
}

.sub() fails, returning NULL while logging a VCL_Error message, if replacement is undefined.

regex.info_bool

BOOL regex.info_bool(ENUM {ALLOW_EMPTY_CLASS,ANCHORED,ALT_BSUX,ALT_CIRCUMFLEX,ALT_VERBNAMES,CASELESS,DOLLAR_ENDONLY,DOTALL,DUPNAMES,EXTENDED,FIRSTLINE,MATCH_UNSET_BACKREF,MULTILINE,NEVER_BACKSLASH_C,NEVER_UCP,NEVER_UTF,NO_AUTO_CAPTURE,NO_AUTO_POSSESS,NO_DOTSTAR_ANCHOR,NO_START_OPTIMIZE,NO_UTF_CHECK,UCP,UNGREEDY,USE_OFFSET_LIMIT,UTF,HAS_FIRSTCODEUNIT,MATCH_ATSTART,HAS_LASTCODEUNIT,HAS_BACKSLASHC,HAS_CRORLF,JCHANGED,MATCH_EMPTY}, BOOL compiled=1)

Return true or false about a property of the regex that the object represents. This method and the other .info_* methods may be helpful for debugging and optimizing regular expression matching, for example by determining whether PCRE2 could enable certain optimizations for the pattern.

The ENUM determines which property is to be inspected. If the ENUM is any one of:

ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX, ALT_CIRCUMFLEX,
ALT_VERBNAMES, CASELESS, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED,
FIRSTLINE, MATCH_UNSET_BACKREF, MULTILINE, NEVER_BACKSLASH_C,
NEVER_UCP, NEVER_UTF, NO_AUTO_CAPTURE, NO_AUTO_POSSESS,
NO_DOTSTAR_ANCHOR, NO_START_OPTIMIZE, NO_UTF_CHECK, UCP, UNGREEDY,
USE_OFFSET_LIMIT, UTF

then the return value of info_bool() indicates whether the corresponding compile option is true for the pattern. If compiled is true, then the return indicates whether the option was set to true after the pattern was compiled, even if it was specified differently (or left to the default) in the object constructor. If compiled is false, then the method returns the value of the option as it was provided in the constructor.

For example, if the compile option anchored was set to false in the constructor (or left to the default), PCRE2 may nevertheless determine that the pattern is anchored if certain conditions are satisfied (which are described in detail in pcre2api(3)). In that case, info_bool() will return true if compiled is true, and false if compiled is false.

compiled is true by default, and is ignored for the other ENUM values.

The other ENUMs are interpreted as follows:

HAS_FIRSTCODEUNIT
If the pattern is unanchored, PCRE2 may determine that there is a unique code unit (a byte) that must appear at the start of the matching part of a string. For example, the part of a string that matches (cat|cow|coyote) must begin with a c. info_bool(HAS_FIRSTCODEUNIT) returns true if there is such a code unit, and false if the pattern is anchored or if no unique first code unit could be determined. If there is such a first code unit, it is returned by info_str(FIRSTCODEUNIT). Note that in non-UTF mode, the first code unit is the same as the first character, but for UTF-8 patterns, it may be the first byte in a multibyte character.
MATCH_ATSTART
If the pattern is unanchored and no unique first code unit in the matching part of the string is known, PCRE2 may determine that the pattern is constrained to match at the start of the subject string, or following a newline in the subject. In that case, info_bool(MATCH_ATSTART) returns true; it returns false if the pattern is anchored, if a unique first code unit could be found, or if the pattern could not be determined to match at the start.
HAS_LASTCODEUNIT
Under certain circumstances, PCRE2 may determine a rightmost literal code unit that must exist in a matching string, other than at the start. This is not necessarily the last byte in the matching part of a string, but rather the last literal code unit known to be required. For example, the b is recorded for this purpose for the pattern ab\d+, although the b must be followed by digits. In there is such a last code unit, info_bool(HAS_LASTCODEUNIT) returns true, and that value can be retrieved from info_str(LASTCODEUNIT). For anchored patterns, PCRE2 records a possible last literal code unit only if a part of the pattern that comes before it has variable length. For example, z is recorded for ^a\d+z\d+ (because one or more digits must come before it), but none is recorded for ^a\dz\d (because matching strings have a fixed length). As with the first code unit, the last code unit may be a byte in a multibyte UTF-8 character, if UTF is enabled for the pattern.
HAS_BACKSLASHC
Return true if and only if \C appears in the pattern.
HAS_CRORLF
Return true if and only if the pattern contains explicit matches for CR or LF characters. These can be literal carriage returns or linefeeds in the pattern, or the escape sequences \r or \n.
JCHANGED
Return true if and only if the pattern contains (?J) or (?-J) to enable or disable JIT-matching.
MATCH_EMPTY
Return true if and only if PCRE2 determines that the pattern might match the empty string. For certain complex patterns (with recursive subroutines), it may not be possible to determine; in that case, PCRE2 cautiously returns true.

Example:

# To determine if the FIRSTCODEUNIT optimization could be applied.
if (myregex.info_bool(HAS_FIRSTCODEUNIT)) {
    std.log("First matching char in the pattern = "
            + myregex.info_str(FIRSTCODEUNIT));
}

regex.info_int

INT regex.info_int(ENUM {BACKREFMAX,CAPTURECOUNT,JITSIZE,MATCHLIMIT,MAXLOOKBEHIND,MINLENGTH,RECURSIONLIMIT,SIZE})

Return an integer that describes a property of the pattern that the object represents, as determined by the ENUM.

BACKREFMAX
Return the highest back reference within the pattern. Remember that named groups also acquire group numbers, and thus count towards the highest backref. A conditional subpattern such as (?(3)a|b), which checks if a capturing group is set, also counts as a backref. If there are no backrefs, return 0.
CAPTURECOUNT
Return the highest capturing group number in the pattern. If the (?| construct (which allows duplicate group numbers, see pcre2pattern(3)) is not used in the pattern, then the value returned is also the total number of capturing groups.
JITSIZE
Return the size of JIT-compiled code for the pattern. Returns 0 if the pattern was not JIT-compiled.
MATCHLIMIT
If the pattern contains the construct (*LIMIT_MATCH=nnnn) to set the match limit (see the match option match_limit above), then return the limit that it sets. Returns -1 if no such value has been set.
MAXLOOKBEHIND
Return the number of characters in the longest lookbehind assertion in the pattern. Returns 0 if there are no lookbehinds.
MINLENGTH
If PCRE2 has determined that there is a lower bound for the length of a string that may match the pattern, then return that value. Returns 0 if no lower bound is known. This is not necessarily the same as the shortest string that may possibly match; but any string that does match must be at least that long.
RECURSIONLIMIT
If the pattern contains the construct (*LIMIT_RECUSRION=nnnn) (see the match option recursion_limit above), then return the value that was set. Returns -1 if no such value has been set.
SIZE
Return the size of the compiled pattern, as used for the interpretive matcher, in bytes. This is independent of the value returned by info_int(JITSIZE).

Example:

# To determine if a lower bound on the length of matching strings
# could be found.
if (myregex.info_int(MINLENGTH) != 0) {
    std.log("Lower bound on matching string length = "
             myregex.info_int(MINLENGTH));
}
else {
    std.log("No lower bound for matching string lengths found");
}

regex.info_str

STRING regex.info_str(ENUM {BSR,FIRSTCODEUNIT,FIRSTCODEUNITS,LASTCODEUNIT,NEWLINE}, STRING sep=" ")

Return a string that describes a property of the pattern represented by the object, as determined by the ENUM. The sep parameter is only relevant when the ENUM FIRSTCODEUNITS is used, as described below.

BSR
Return "UNICODE", meaning that \R in the pattern matches any Unicode line ending sequence, or "ANYCRLF", meaning that it matches only CR, LF or CRLF.
FIRSTCODEUNIT
If PCRE2 determines that there is a unique first code unit that must begin the matching part of a string (as described above for info_bool(HAS_FIRSTCODEUNIT)), then return that code unit in a string. Returns the empty string if no such code unit was determined; this is also the case if the pattern is anchored. Recall that a code unit corresponds to a character in non-UTF mode, but may be a byte in a multibyte character when UTF-8 is enabled. The code unit is not escaped in the return string.
FIRSTCODEUNITS
(Note the difference between FIRSTCODEUNIT, singular, and FIRSTCODEUNITS, plural.) For an unanchored pattern, if PCRE2 cannot determine a unique code unit that must appear at the start of the matching part of a string, it may be able to determine a set of such code units. For example, if the pattern starts with [abc], then the matching part must begin with a, b or c. In that case, info_str(FIRSTCODEUNITS) returns those code units in a string, separated by the string given as sep. The default value of sep is " " (the string containing one space). If the pattern is anchored, or if a unique first code unit could be found, or if no set of first code units could be found, then return the empty string.
LASTCODEUNIT
If PCRE2 has recorded a rightmost literal code unit that must exist in a matching string, as described for info_bool(HAS_LASTCODEUNIT) above, then return that code unit in a string. Returns the empty string if no such code unit was recorded.
NEWLINE

Return a string describing the default sequence recognized as a "newline" for the pattern:

  • "CR" (carriage return)
  • "LF" (linefeed)
  • "CRLF" (CR followed by LF)
  • "ANYCRLF" (CR, LF or CRLF)
  • "UNICODE" (any Unicode line-ending sequence)

Example:

# Determine if a set of first matching characters could be found.
std.log("First matching chars: " + myregex.info_str(FIRSTCODEUNITS));

Regex functional interface

match

BOOL match(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0)

Compile the pattern and return true if it matches subject. Compilation and matching are subject to the given options, or default options. The compiled pattern is discarded after use, and pattern is compiled on every invocation.

The call fails, logging an VCL_Error message and returning false, if:

  • pattern is undefined.
  • The compile fails (for example due to a syntax error).
  • Any compile or match option is illegal as described above.

As with the .match() method, if subject is undefined, then it is assumed to be the empty string.

Example:

# Match a request header against a pattern provided in a response
# header.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
    call do_on_match;
}

backref

STRING backref(PRIV_TASK, INT ref, STRING fallback="**BACKREF FUNCTION FAILED**")

Return the nth captured subexpression from the most recent successful call of the match() function in the current client or backend context, or a fallback string if the capture fails. The default fallback is "**BACKREF FUNCTION FAILED**".

As with the regex.backref() method, fallback is returned after any failed invocation of the match() function, or if there is no captured group corresponding to the backref number. The function is not affected by native VCL regex operations, or any other method or function of the VMOD except for the match() function.

The function fails, returning fallback and logging a VCL_Error message, under the same conditions as the corresponding method:

  • fallback is undefined.
  • ref is out of range.
  • The match() function was never called in this context.
  • The pattern failed to compile for the previous match() call.

Example:

# Match against a pattern provided in a response header, and capture
# subexpression 1.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
   set resp.http.X-Group-1 = pcre2.backref(1);
}

namedref

STRING namedref(PRIV_TASK, STRING name, STRING fallback="**NAMEDREF FUNCTION FAILED**")

Return the captured subexpression designated by name from the most recent successful call of the match() function in the current context, or fallback in case of failure. The default fallback is "**NAMEDREF FUNCTION FAILED**".

The function returns fallback when the previous invocation of the match() function failed, and is only affected by use of the match() function. The function fails, returning fallback and logging a VCL_Error message, under the same conditions as the corresponding method:

  • fallback is undefined.
  • name is undefined or the empty string.
  • There is no such named group.
  • match() was not called in this context.
  • The pattern failed to compile for the previous match() call.

Example:

if (pcre2.match(resp.http.X-Pattern-With-Names, req.http.X-Subject)) {
   set resp.http.X-Group-Foo = pcre2.namedref("foo");
}

sub

STRING sub(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, STRING replacement, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)

Compile pattern, and if it matches subject, then return a string formed by replacing the part that matched by replacement. If the pattern does not match, return subject unchanged. The compile, match and substitution options affect all of these operations, as described above.

The syntax of the replacement string, as modified if the sub_extended option is true, is the same as documented above for the .sub() method.

sub() fails, returning NULL and logging a VCL_Error message, if:

  • Either of pattern or replacement is undefined.
  • pattern cannot be compiled.

Example:

# If the beresp header X-Sub-Letters contains "b+", and Host contains
# "www.yabba.dabba.doo.com", then set X-Yada to
# "www.yada.dabba.doo.com".
set beresp.http.X-Yada = re2.sub(beresp.http.X-Sub-Letters,
                                 bereq.http.Host, "d");

Library configuration

config_bool

BOOL config_bool(ENUM {JIT,STACKRECURSE,UNICODE})

Return true or false about a property of the PCRE2 library to which the VMOD is linked, identified by the ENUM. The config_* functions make it possible to discover features of the library that were chosen when it was built.

JIT
Return true if the library supports just-in-time compilation and matching.
STACKRECURSE
Return true if internal recursion for the PCRE2 matcher uses the system stack to maintain its state, which is the usual way the library is built. If false is returned, PCRE2 uses blocks of data on the heap rather than recursive function calls.
UNICODE
Return true if Unicode support is available. If so, then the compile option utf can be used to define a pattern and the strings against which it is matched as UTF-8 strings.

Example:

if (pcre2.config_bool(JIT)) {
    std.log("JIT supported for PCRE2");
}
else {
    std.log("JIT not supported for PCRE2");
}

config_str

STRING config_str(ENUM {BSR,JITTARGET,NEWLINE,UNICODE_VERSION,VERSION})

Return a string describing a property of the PCRE2 library.

BSR
Return a string indicating what the \R escape sequence matches by default: UNICODE for Unicode line-ending sequences, or ANYCRLF for only CR, LF and CRLF. This is the default that holds if no value is given for the compile option bsr.
JITTARGET
Return a string identifying the architecture for which the JIT compiler is configured. If JIT is not enabled, the returned string contains the phrase "JIT not supported".
NEWLINE

Return a string identifying the character sequence that is recognized by default as a newline:

  • "CR" (carriage return)
  • "LF" (linefeed)
  • "CRLF" (CR followed by LF)
  • "ANY" (any Unicode line ending)
  • "ANYCRLF" (any of CR, LF or CRLF)

This is the default if no value is given for the compile option newline.

UNICODE_VERSION
If Unicode is supported by the library, return the Unicode version string. If not, return "Unicode not supported".
VERSION
Return the PCRE2 version string.

Example:

std.log("Linked to PCRE2 version " + pcre2.config_str(VERSION));

config_int

INT config_int(ENUM {LINKSIZE,MATCHLIMIT,PARENSLIMIT,RECURSIONLIMIT})

Return an integer describing a property of the PCRE2 library.

LINKSIZE
Return the number of bytes used for internal linkage (offsets) in compiled regular expressions. This determines the size of the largest possible pattern; the default link size of 2 allows for patterns of up to 64K bytes.
MATCHLIMIT
Return the default value for the match_limit compile option, which limits the effort of the matcher when no match is found.
PARENSLIMIT
Return the default value of the parens_nest_limit compile option, which limits the depth of parenthesis nesting in patterns, and hence the use of the stack during compilation.
RECURSIONLIMIT
Return the default value of the recursion_limit compile option, which limits the depth of recursion, and hence stack usage, for the the interpretive (non-JIT) matcher.

Example:

std.log("Default PCRE2 match limit = " + config_int(MATCHLIMIT));

version

STRING version()

Returns the version string for this VMOD.

Example:

std.log("Using VMOD pcre2 version " + pcre2.version());

REQUIREMENTS

This VMOD has been tested with Varnish version 5.1.2 and PCRE2 version 10.23.

INSTALLATION

See INSTALL.rst in the source repository.

LIMITATIONS

The VMOD allocates Varnish workspace for a variety of purposes:

  • The string returned by the sub method and function.
  • Buffers for temporary data structures needed by the PCRE2 library, for example to save information about a match for use by the backref and namedref methods and functions.
  • A copy of the subject string for the match method and function, if it is not already in workspace, so that it can be safely accessed by backref and namedref.
  • Return strings for some uses of info_str and config_str.
  • Temporary buffers for error message strings from the PCRE2 library.

If VMOD operations fail with the "out of space" error message in the Varnish log (with the VCL_Error tag), increase the varnishd runtime parameters workspace_client and/or workspace_backend.

The PCRE2 interpretive and JIT matchers are backtracking matchers, and the interpretive matcher is recursive, using part of the stack on each recursive call (in the default library configuration). For patterns with large search spaces, this can lead to slow matches, high CPU usage, and stack overflow due to deep recursion, which typically causes Varnish to segfault. This has occasionally been the subject of issues reported to the Varnish project.

For most common uses of regular expressions in VCL, PCRE2 is very fast and has minimal resource consumption. This depends strongly on how the regex is written -- a well-crafted pattern helps the matcher limit backtracking, fail early on non-matches, and make use of some the optimizations that PCRE2 can apply. Some of the compile and match options also help to optimize the match operation. Which of these measures is possible depends, of course, on what you want the regex to do.

Writing optimized regexen is a very broad subject, beyond the scope of this manual. There is some advice in pcre2perform(3), and in many other sources.

If your use case requires patterns and subject strings that can lead to very large search spaces, consider using some of the options available in the VMOD that limit excessive effort for unsuccessful matches. In particular, consider lowering the match options match_limit and recursion_limit. You can also use offset_limit to set a maximum length to search for a match in the subject string (for which you will have to set the compile option use_offset_limit). These may cause the matcher to halt before it has exhausted all possibilities for a match (but it appears to be common that, if the matcher has to search for a long time, then there was never any match to be found).

If you encounter stack overflow, it may help to increase the stack size (by changing limits.conf or calling ulimit -s before starting Varnish). Since Varnish 4.1, you can also increase the varnishd parameter thread_pool_stack. Bear in mind that this increases the total RAM usage of Varnish.

ACKNOWLEDGEMENTS

A tip of the hat to Philip Hazel, who released the first version of PCRE twenty years before this VMOD was developed.

A few sentences in this manual are identical to or very closely track phrasings in the PCRE2 documentation, if there was simply no better way to say what needs to be said.

SEE ALSO

COPYRIGHT

This document is licensed under the same conditions
as the libvmod-pcre2 project. See LICENSE for details.

Author: Geoffrey Simmons <geoffrey.simmons@uplex.de>