Commit ed328f11 authored by Geoff Simmons's avatar Geoff Simmons

Add documentation.

parent de676c22
Pipeline #267 skipped
......@@ -26,13 +26,602 @@ import pcre2 [from "path"] ;
::
new OBJECT = ...
# object interface
new OBJECT = pcre2.regex(STRING pattern [, compile options])
BOOL <OBJ>.match(STRING subject [, match options])
STRING <OBJ>.backref(INT ref)
STRING <OBJ>.namedref(STRING name)
STRING <OBJ>.sub(STRING subject, STRING replacement [, match options]
[, substitution options])
BOOL <OBJ>.info_bool(ENUM)
INT <OBJ>.info_int(ENUM)
STRING <OBJ>.info_str(ENUM)
# function interface
BOOL pcre2.match(STRING pattern, STRING subject [, compile options]
[, match options])
STRING pcre2.backref(INT ref)
STRING pcre2.namedref(STRING name)
STRING pcre2.sub(STRING pattern, STRING subject, STRING replacement
[, compile options] [, match options]
[, substitution options])
# library configuration
BOOL pcre2.config_bool(ENUM)
INT pcre2.config_int(ENUM)
STRING pcre2.config_str(ENUM)
DESCRIPTION
===========
This Varnish Module (VMOD) provides access to the PCRE2 regular
expresion library.
expression library. PCRE2 is the Perl-compatible regular expression
library with a revised API, the successor to the PCRE library that
implements native regexen in Varnish VCL. See `pcre2(3)`_ and the
manuals that it references for details about the PCRE2 library.
PCRE2, by itself, does not change regular expressions from the
perspective of the end user -- the syntax and semantics of patterns
and pattern matching remained largely the same at the time PCRE2 was
introduced. The new library is a refactoring of the internal API,
which is transparent to the user, and the VMOD endeavors to make use
of the new internal features advantageously for VCL.
Some of the differences between the VMOD and native VCL regexen are:
* The VMOD provides methods and functions to retrieve back references
after a match that are easier to use than the idiom with the
``regsub`` function that is necessary in native VCL. It also
provides the means to retrieve references to named capturing groups.
* The functional interface makes it possible to use patterns that are
not known until runtime.
* PCRE2 introduces a new native substitution function, similar to the
``regsub`` and ``regsuball`` functions in VCL, except that the
substitution syntax is different and provides more features.
* Parameters that limit the depth of recursion and backtracking in
match operations, which are set globally in Varnish, can be set for
individual matches in the VMOD.
* The VMOD can support matching against UTF-8 strings, if it is
running against a PCRE2 library that was built to support Unicode.
* The VMOD exposes considerably more functionality of the underlying
library. VCL provides a general-purpose regular expression facility
-- PCRE could be easily replaced as its regex engine. The VMOD is
meant to be specific to PCRE2, and makes a full range of its
features available in VCL.
* The VMOD provides methods and functions that allow you to inspect
properties of patterns and of the library. These are not likely to
be useful on the fast path of production deployments, and are not
optimized for that. But they may be useful during development to
debug and optimize regex matching.
Since the introduction of PCRE2, the original PCRE library is being
maintained for bugfixes, but development of new features and
optimizations are only being done for PCRE2. So the VMOD will make it
possible to take advantage of improvements in the library as they are
released.
Here are some simple usage examples::
# regex objects are created in vcl_init, and the regular expressions
# are compiled when VCL is loaded.
sub vcl_init {
# A regex to match the "foo" cookie, and capture its value.
new foo = pcre2.regex("\bfoo=([^;,\s]+\b)");
# A regex to match a URL beginning with the prefix "/bar/", and
# capture its suffix.
new bar = pcre2.regex("^/bar/(.+)");
}
sub vcl_recv {
# If the cookie header contains "foo", then assign its value
# to another header.
if (foo.match(req.http.Cookie)) {
set req.http.X-Foo-Value = foo.backref(1);
}
# If the URL begins with "/bar/", then replace the prefix with
# "/baz/quux/".
if (bar.match(req.url)) {
set req.url = "/baz/quux/" + bar.backref(1);
}
}
Object and functional interfaces
--------------------------------
The VMOD provides regular expression operations by way of the
``regex`` object interface and a functional interface. For ``regex``
objects, the pattern is compiled at VCL initialization time, and the
compiled pattern is re-used for each invocation of its
methods. Compilation failures (due to errors in the pattern) cause
failure at initialization time, and the VCL fails to load. The
``.backref()`` and ``.namedref()`` methods refer back to the last
invocation of the ``.match()`` method for the same object. The
``.sub()`` method also re-uses an object's compiled pattern.
The functional interface provides the same set of operations, but the
pattern is compiled at runtime on each invocation of the ``match()``
and ``sub()`` functions (and then discarded). Compilation failures are
reported as errors in the Varnish log. The ``backref()`` and
``namedref()`` functions refer back to the last invocation of the
``match()`` function, for any pattern.
Compiling a pattern at runtime on each invocation is considerably more
costly than re-using a compiled pattern. So for patterns that are
fixed and known at VCL initialization, the object interface should be
used. The functional interface should only be used for patterns whose
contents are not known until runtime.
Compile, match and substitution options
---------------------------------------
The VMOD has unusually long lists of parameters for its methods and
functions -- over 40 for the ``sub()`` function, for example. But
nearly all of these have default values, and it is only necessary to
specify options in VCL that differ from the defaults.
The optional parameters affect the interpretations of patterns and the
operation of matches and substitutions, and come in three groups:
* *Compile* options, used wherever a pattern is compiled: in the
``regex`` object constructor, and the ``match()`` and ``sub()``
functions.
* *Match* options, used wherever a match is performed: in the
``match`` and ``sub`` methods and functions.
* *Substitution* options, used in the ``sub`` method and function.
The options have call scope, meaning that they are evaluated only once
for each invocation of a function or method at its particular location
in the VCL source, on the first invocation after the VCL instance is
loaded. The options are then cached and re-used for all subsequent
invocations, and cannot be changed (until a new VCL instance is
loaded).
Compile options
~~~~~~~~~~~~~~~
Compile options define properties of patterns. See `pcre2pattern(3)`_
for details of PCRE2 pattern syntax, and `pcre2syntax(3)`_ for a quick
reference.
The default value of all of the BOOL options is **false**.
See also `JIT compilation and matching`_ below.
``allow_empty_class``
If true, then a pattern may include ``[]`` to denote an empty
character class. This, in part, supports compatibility with regexen
in ECMAscript (also known as Javascript). By default, a closing
square bracket after an opening one is interpreted as a character in
the class (and ``]`` must appear later in the pattern).
``alt_bsux``
(Referring to "backslash-u" and "backslash-x".) If true, then three
escape sequences are interpreted differently (for compatibility with
ECMAscript):
* ``\U`` matches an upper case ``U`` character. By default, ``\U``
causes a compile error.
* ``\u`` matches a lower case ``u``, unless it is followed by four
hexadecimal digits, in which case the hex number identifies the
code point to be matched. By default, ``\u`` causes a compile
error.
* ``\x`` matches a lower case ``x``, unless it is followed by four
hex digits, in which case it identifies the code point to match.
By default, ``\x`` must always be followed by zero to two hex
digits to identify a one-byte character (for example, ``\xz``
matches binary zero followed by ``z``).
``alt_circumflex``
If true, and if ``multiline`` is also true, then the ``^``
meta-character matches after a newline appearing as the last
character in a string. By default, ``^`` does not match after
a terminating newline.
``alt_verbnames``
If true, then backslash processing may be applied to verb names in
verb sequences such as ``(*MARK:NAME)``, so that the name can, for
example, include a closing parenthesis as ``\)`` or between ``\Q``
and ``\E``. By default, no processing is applied to verb names, and
they end at the first closing parenthesis (regardless of any
backslash).
``anchored``
If true, then the pattern is anchored, meaning that it is
constrained to match at the starting point of a string. This may
also be achieved with constructs in the pattern.
``bsr``
(For "backslash-R".) If this ENUM value is set, then it determines
which sequences are matched by ``\R``. If set to ``UNICODE``, then
``\R`` matches any UTF-8 newline sequence. If set to ``ANYCRLF``,
then it matches CR (carriage return, or ``\r``), LF (linefeed, or
``\n``), or CR followed by LF. By default, ``\R`` matches the
sequence chosen when the PCRE2 library was built, which can be
determined from ``config_str(BSR)`` (the default default is
Unicode). See `pcre2pattern(3)`_ for details about ``\R``.
``caseless``
If true, then matches for this pattern are case-insensitive. This
may also be achieved with ``(?i)`` in the pattern.
``dollar_endonly``
If true, then the ``$`` metacharacter matches only at the end of a
string. By default, ``$`` also matches before newlines within the
string (but not before newlines that come immediately after a
newline). ``dollar_endonly`` is ignored when ``multiline`` is true.
``dotall``
If true, then the ``.`` metacharacter matches any character,
including newlines. But it only ever matches one character, even if
newlines are coded as CRLF. By default, dots do not match
newlines. The effect of ``dotall`` can also be achieved with
``(?s)`` in the pattern.
``dupnames``
If true, then the names used for named capturing groups are not
required to be unique. By default, names for capturing groups may
only be used once.
``extended``
If true, then pattern syntax is permitted to contain constructs that
serve as self-documentation:
* Most whitespace is ignored, except when escaped or inside a
character class (and a few other exceptions detailed in
`pcre2api(3)`_).
* All characters between an unescaped ``#`` and the next newline are
ignored, and can be used as comments.
For example, this is a self-documenting declaration of a pattern
that matches IPv6 addresses::
new ipv6 = pcre2.regex(extended=true, caseless=true, pattern=
{"^(?!:) # colon disallowed at start
(?: # start of item
(?: [0-9a-f]{1,4} | # 1-4 hex digits or
(?(1)0 | () ) ) # fail if null previously matched
: # followed by colon
){1,7} # end item; 1-7 of them required
[0-9a-f]{1,4} $ # final hex number at end of string
(?(1)|.) # there was an empty component
"});
The effect of ``extended`` can also be achieved with the ``(?x)``
option in a pattern.
``firstline``
If true, an unanchored pattern must match before or at the first
newline in the subject string (though the matched text may continue
over a newline). If the ``offset_limit`` option is also set for a
match, then the match must occur within the offset limit and in the
first line.
``locale``
If ``locale`` is set to a string matching a locale that is available
on the system on which Varnish is running, then that locale is used
for the pattern to determine which characters are letters, digits,
upper and lower case, and so forth. Hence this option affects the
interpretation of constructs such as ``\w`` and ``\d``, the
``caseless`` option, and so on. This only applies to single-byte
characters.
If ``locale`` is set to a string that is not recognized as a locale,
then compilation fails.
By default, PCRE2 uses tables established when the library is built
to recognize character properties; normally, these only recognize
ASCII characters.
Quoting `pcre2api(3)`_:
The use of locales with Unicode is discouraged. If you are
handling characters with code points greater than 128, you should
either use Unicode support, or use locales, but not try to mix the
two.
``match_unset_backref``
If true, then a back reference to an unset capturing group matches
an empty string; thus ``(\1)(a)`` successfully matches ``a``. This
makes the pattern similar to an ECMAscript pattern. By default, an
unset backref causes the matcher to backtrack, and possibly fail.
``max_pattern_len``
If this INT value is greater than 0, then it sets a maximum length
for the pattern string to be compiled. If the pattern is longer, then
compilation fails.
``multiline``
If true, then the ``^`` and ``$`` meta-characters match immediately
after and before internal newlines in the subject string, respectively,
in addition to matching at the start and end of the string. By default,
the start and end anchors only match at the beginning and end of the
string, regardless of internal newlines. The effect of ``multiline``
can also be achieved with ``(?m)`` in the pattern.
``never_backslash_c``
If true, then ``\C`` may not be used in a pattern, and causes
compile failure. ``\C`` always matches exactly one byte, even in UTF
mode, and may lead to unpredictable effects if it matches in the
middle of a multibyte UTF-8 character. ``\C`` may have been
prohibited by a build-time option in the library, which can be
discovered by calling ``config_bool(NEVER_BACKSLASH_C)``.
``never_ucp``
If true, then Unicode properties are not used to interpret ``\B``,
``\b``, ``\D``, ``\d``, ``\S``, ``\s``, ``\W``, ``\w``, and some of
the POSIX character classes in the pattern. It is then impossible to
activate this facility by including ``(*UCP)`` at the start of the
pattern. If ``never_ucp`` and ``ucp`` are both set to true, then
the compile fails.
``newline``
If this ENUM value is set, it determines which characters are to be
matched as newlines in the pattern. It can be set to:
* ``CR`` (carriage return)
* ``LF`` (linefeed)
* ``CRLF`` (CR followed by LF)
* ``ANYCRLF`` (CR, LF or CRLF)
* ``UNICODE`` (any Unicode line-ending sequences)
By default, the newline sequence chosen for the PCRE2 library when
it was built is used, which can be determined from
``config_str(NEWLINE)``.
``no_auto_capture``
If true, then numbered capturing groups are disabled in the pattern.
Any opening parenthesis not followed by ``?`` is then interpreted as
if it were followed by ``?:`` (that is, it forms a non-capturing
group). Named capturing groups can still be used, and these also
acquire a capturing group number, so ``namedref`` and ``backref``
can still be used (but only for the named groups).
``no_auto_possess``
If true, then the "auto-possessification" optimization is disabled
for the pattern, which for example interprets ``a+b`` as ``a++b``,
using the "possessive quantifier", to prevent backtracks into ``a+``
that can never be successful. If the option is true, then the full
unoptimized search is run.
``no_start_optimize``
If true, then some optimizations for the start of the match are
disabled. This has the effect that certain constructs in the
pattern, such as ``(*COMMIT)`` or ``(*MARK)``, are evaluated at
every possible starting position in the string, while they may have
been skipped when the optimizations are applied. Thus this option
may change the result of ``match`` calls in patterns that include
such constructs. See `pcre2api(3)`_ for details.
``no_utf_check``
If this option and ``utf`` are both true, then validity checks to
determine if the pattern is a valid UTF string are disabled. This
may save CPU usage and time for the ``match()`` and ``sub()``
functions, which compile patterns on every invocation, and check UTF
strings for validity by default. But you should only do so if you
are sure that the inputs are valid, because running matches in UTF
mode against invalid strings is undefined, and may cause Varnish to
crash or loop. By default, invalid UTF strings in the pattern cause
the compile to fail in UTF mode. See `pcre2unicode(3)`_ for details.
``parens_nest_limit``
If this INT value is greater than 0, it sets the maximum depth of
parenthesis nesting in a pattern. It applies to all kinds of
parentheses, not just captruing groups. The limit prevents patterns
from using too much of the stack when compiled, and may be useful
for the functional interface, for which patterns are compiled at
runtime. By default, the nesting limit set for the PCRE2 library at
build time is imposed, which is returned by
``config_int(PARENSLIMIT)``.
``ucp``
If this option and ``utf`` are both true, then Unicode properties
are used to interpret ``\B``, ``\b``, ``\D``, ``\d``, ``\S``,
``\s``, ``\W``, ``\w``, and some of the POSIX character classes in
the pattern. The same effect can be achieved by including ``(*UCP)``
at the start of the pattern. By default, only ASCII characters are
considered for these constructs, which is faster than considering
Unicode properties. If Unicode was disabled at build time for the
PCRE2 library, which can be discovered by calling
``config_bool(UNICODE)``, then the compile fails when this option is
true. Compiles also fail if this option and ``never_ucp`` are both
true. See `pcre2unicode(3)`_ for details about Unicode character
properties.
``ungreedy``
If true, then the "greediness" of quantifiers in the pattern is
inverted, so that they are not greedy by default, but become
greedy when followed by ``?``. The same effect can be achieved
by including ``(?U)`` in the pattern.
``use_offset_limit``
This option must be set to true for a pattern if you intend to use
the ``offset_limit`` parameter in match and substitution operations
to limit how far a string is searched for an unanchored match. If an
``offset_limit`` is set for an invocation of the ``match`` or
``sub`` methods or functions, but this option was not set to true
for the pattern, then then the match fails.
``utf``
If true, then both the pattern and the strings against which it is
matched are processed as UTF-8 strings. If Unicode support was
disabled when the PCRE2 library was built, which can be determined
from ``config_bool(UNICODE)``, then the compile fails when ``utf``
is true. See `pcre2unicode(3)`_ for details about Unicode support in
PCRE2.
Match options
~~~~~~~~~~~~~
Match options affect the operation of matching in the ``match`` and
``sub`` methods and functions. By default, all of the BOOL options
are **false**. The INT options are 0 by default (meaning that they
are ignored, and the global defaults hold). The INT options MAY NOT
be less than 0; if they are, then the match fails.
``anchored``
If true, then the match is constrained to match at the start of the
string, regardless of whether the pattern is anchored. By default, a
match is searched for anywhere in the string if the pattern is not
anchored.
``len``
If this INT value is greater than 0, it sets the length of the
subject string to be matched. By default, the full string is matched.
``match_limit``
If this INT value is greater than 0, it sets a limit to the effort
used by the PCRE2 matching function to find a match. This can
prevent matches from excessive backtracking, if there is a very
large search space but a match is never found. It is equivalent to
the varnishd parameter ``pcre_match_limit``, except that it applies
only to the match operation in which it was set, not globally. The
varnishd parameters for PCRE have no effect on this VMOD. By
default, the match limit is imposed that was set for the PCRE2
library at build time, which can discovered from
``config_int(MATCHLIMIT)``.
``not_bol``
If true, the first character of the subject is string is not
considered to be the beginning of a line, so the ``^`` metacharacter
does not match before it. If the compile option ``multiline`` was
not set to true for the pattern, then ``^`` never matches. This
option only affects the circumflex metacharacter.
``not_eol``
If true, the end of the subject string is not considered to be the
end of a line, so the ``$`` metacharacter does not match after it.
If ``multiline`` was not set to true for the pattern, then ``$``
never matches. This option only affects the dollar metacharacter.
``not_empty``
If true, then the empty string is not a valid match. If the matcher
finds an empty match, then it considers other alternatives, and if
no other valid matches are found, then the match fails.
``not_empty_atstart``
If true, then the empty string is not a valid match at the start of
the subject string. An empty string match later in the subject is
permitted.
``no_jit``
If true, then the just-in-time matcher is not used, even when the
pattern was compiled for JIT. In that case, PCRE2's "traditional"
interpretive matcher is used (as is always the case if JIT is not
available, or if the pattern was not JIT-compiled). If ``not_jit``
is true for an invocation of the ``match()`` or ``sub()`` functions,
which compile a pattern on every call, then the pattern is also not
JIT-compiled. See `JIT compilation and matching`_ below.
``no_utf_check``
If true, then the subject is not checked for validity as a UTF-8
string when matched against a pattern for which ``utf`` was set to
true. This may speed up matching, but should only be done if you
are sure that the inputs are valid UTF-8. By default, UTF validity
is checked for matches against patterns that were compiled with
``utf``.
``offset_limit``
If this INT value is greater than 0, it limits how far an unanchored
search can advance in the subject string. For example, if the
pattern ``abc`` is matched against the string ``"123abc"`` and the
offset limit is less than 3, the match fails. To use this parameter,
the compile option ``use_offset_limit`` must have been set to true
for the pattern at compile time; otherwise the match fails. By
default, unanchored matches are searched for until the end of the
string.
``recursion_limit``
If this INT value is greater than 0, then it limits the depth of
recursion for matches using the interpretive matcher. It is
equivalent to the varnishd parameter ``pcre_match_limit_recursion``,
but only applies to the individual match. This limits the depth of
recursion and use of the stack for matches that may cause excessive
recursion and stack overflow (which usually causes Varnish to
crash). The limit is not relevant to the JIT matcher, and is ignored
for JIT matching. By default, the recursion limit set for the PCRE2
library at build time applies, which can be determined from
``config_int(RECURSIONLIMIT)``.
Substitution options
~~~~~~~~~~~~~~~~~~~~
The ``sub`` method and function use all of the match options (since
they run a match), and the following additional options. (The ``sub``
function also uses the compile options, since it compiles a pattern.)
``suball``
If true, then the substitution iterates over the subject string and
replaces every matching substring, making the substitution similar
to the native VCL ``regsuball`` function. By default, only the first
matching substring is replaced, making the substitution similar to
VCL's ``regsub`` function.
``sub_extended``
If true, then an extended syntax is enabled for the replacement
string. Details of the replacement syntax are documented for the
``.sub()`` method below.
``unknown_unset``
If true, then references to capturing groups in the replacement
string that do not appear in the pattern are treated as unset
groups. By default, unknown references cause the substitution to
fail. Use this option with care, because it causes misspelled group
names or numbers to be silently ignored.
``unset_empty``
If true, then unset capturing groups (including unknown groups when
``unknown_unset`` is also true) are replaced as empty strings. By
default, an attempt to insert an unset group causes the substitution
to fail.
JIT compilation and matching
----------------------------
PCRE2 supports just-in-time compilation for patterns, and a matcher to
go with it. JIT is a heavyweight optimization that may greatly speed
up matching, but requires extra processing at pattern compilation
time. The VMOD supports JIT if it was enabled for the PCRE2 library
when it was built, which can be determined from ``config_bool(JIT)``.
If JIT is available, then it is always applied to the compilation of
patterns in the ``regex`` object constructor. By default it is also
applied when patterns are compiled at runtime in the ``match`` and
``sub`` methods and functions, unless the ``no_jit`` option is true.
For patterns compiled at runtime, it may be worth it to turn off JIT,
if the overhead for JIT-compiles outweighs the advantage of JIT
matching.
If JIT is not available, then PCRE2 always uses the interpretive
matcher.
Unicode
-------
The VMOD only links to the 8-bit version of PCRE2, and hence can
support UTF-8 if Unicode was enabled when the library was built. The
VMOD does not support UTF-16 or UTF-32. Thus the term "code unit", as
used for Unicode and in the PCRE2 documentation, always refers to one
byte.
In UTF mode, characters in patterns and the strings to be matched are
interpreted as UTF-8 code points, and hence may correspond to one to
four bytes. When UTF is not enabled, characters in patterns and
strings are represented by exactly one byte.
See `pcre2unicode(3)`_ for the details of PCRE2 Unicode support.
CONTENTS
========
......@@ -56,8 +645,27 @@ regex
new OBJ = regex(STRING pattern, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0)
# XXX options for dfa_match, jit fast path, start_offset
# XXX option to make saving the match ctx with PRIV_CALL optional
Create a ``regex`` object from ``pattern`` according to the given
compile options (or option defaults). If the pattern is invalid, then
the VCL will fail to load, and the VCC compiler will emit an error
message.
Examples::
sub vcl_init {
# Match this pattern against the Host header (hence
# case-insensitively), and capture part of the domain name.
new domain = pcre2.regex("^www\.([^.]+)\.com$", caseless=true);
# Match a max-age tag and capture the number.
new maxage = pcre2.regex("max-age\s*=\s*(\d+)");
# Group possible subdomains without capturing
new submatcher = pcre2.regex("^www\.(domain1|domain2)\.com$",
never_capture=true, caseless=true);
}
.. _func_regex.match:
regex.match
......@@ -67,6 +675,26 @@ regex.match
BOOL regex.match(PRIV_CALL, PRIV_TASK, STRING subject, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0)
Return ``true`` if the compiled regex matches the ``subject`` string,
as constrained by the given match options or option defaults.
The match may fail if any of the options are illegal for one of the
reasons given above, or if a limit such as the match or recursion
limit is reached. In that case, and error message is written to the
Varnish log using the ``VCL_Error`` tag, and the method returns
``false``.
If ``subject`` is undefined, for example if it is set from an unset
header variable, then it is assumed to be the empty string. This
follows VCL's handling of regex matching when the string to be matched
is unset.
Example::
if (domain.match(req.http.Host)) {
call do_on_match;
}
.. _func_regex.backref:
regex.backref
......@@ -76,6 +704,64 @@ regex.backref
STRING regex.backref(INT ref, STRING fallback="**BACKREF METHOD FAILED**")
Returns the `nth` captured subexpression from the most recent
successful call of the ``.match()`` method for this object in the same
client or backend context, or a fallback string in case the capture
fails. Backref 0 indicates the entire matched string. Thus this
function behaves like the ``\n`` in the native VCL functions
``regsub`` and ``regsuball``, and the ``$1``, ``$2`` ... variables in
Perl. Unlike the regsubs, which limit the backref number to 0 through
9, ``backref`` permits any number that identifies a capturing group in
the pattern.
Since Varnish client and backend operations run in different threads,
``.backref()`` can only refer back to a ``.match()`` call in the same
thread. Thus a ``.backref()`` call in any of the ``vcl_backend_*``
subroutines -- the backend context -- refers back to a previous
``.match()`` in any of those same subroutines; and a call in any of
the other VCL subroutines -- the client context -- refers back to a
``.match()`` in the same client context.
After unsuccessful matches, the ``fallback`` string is returned for
any call to ``.backref()``. The default value of ``fallback`` is
``"**BACKREF METHOD FAILED**"``. ``.backref()`` always fails after a
failed match, even if ``.match()`` had been called successfully before
the failure.
``.backref()`` may also return ``fallback`` after a successful match,
if no captured group in the matching string corresponds to the backref
number. For example, when the pattern ``(a|(b))c`` matches the string
``ac``, there is no backref 2, since nothing matches ``b`` in the
string.
The VCL infix operators ``~`` and ``!~`` do not affect this method,
nor do the functions ``regsub`` or ``regsuball``. Nor is it affected
by the matches performed by any other method or function in this VMOD,
(the ``match()`` function or the ``sub`` method or function).
``.backref()`` fails, returning ``fallback`` and writing an error
message to the Varnish log with the ``VCL_Error`` tag, under the
following conditions (even if a previous match was successful and a
substring could have been captured):
* Any of the match options are illegal (for example, if one of the
numeric limits was set to less than 0).
* The ``fallback`` string is undefined.
* ``ref`` (the backref number) is out of range -- if it is less than 0
or larger than the highest number for a capturing group in the
pattern.
* ``.match()`` was never called for this object in the task scope
prior to calling ``.backref()``.
Example::
if (domain.match(req.http.Host)) {
set req.http.X-Domain = domain.backref(1);
}
.. _func_regex.namedref:
regex.namedref
......@@ -85,6 +771,49 @@ regex.namedref
STRING regex.namedref(STRING name, STRING fallback="**NAMEDREF METHOD FAILED**")
Returns the captured subexpression designated by ``name`` from the
most recent successful call to ``.match()`` in the current context
(client or backend), or ``fallback`` in case of failure. See
`pcre2pattern(3)`_ for details about the use of named subpatterns in
PCRE2 regexen.
Note that a named capturing group can also be referenced as a numbered
group -- the named groups are numbered exactly as if the names were
not present. So an expression returned by ``.namedref()`` will also be
returned by ``.backref()`` with the appropriate number.
``fallback`` is returned when ``.namedref()`` is called after an
unsuccessful match. The default fallback is ``"**NAMEDREF METHOD
FAILED**"``.
Like ``.backref()``, ``.namedref()`` is not affected by native VCL
regex operations, nor by any other matches performed by methods or
functions of the VMOD, except for a prior ``.match()`` for the same
object.
``.namedref()`` fails, returning ``fallback`` and logging a
``VCL_Error`` message, if:
* The ``fallback`` string is undefined.
* ``name`` is undefined.
* There is no such named group.
* ``.match()`` was not called for this object.
Example::
sub vcl_init {
new domain = pcre2.regex("^www\.(?<domain>[^.]+)\.com$");
}
sub vcl_recv {
if (domain.match(req.http.Host)) {
set req.http.X-Domain = domain.namedref("domain");
}
}
.. _func_regex.sub:
regex.sub
......@@ -94,6 +823,113 @@ regex.sub
STRING regex.sub(PRIV_CALL, PRIV_TASK, STRING subject, STRING replacement, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)
If the pattern represented by this object matches ``subject``, then
return a string formed by replacing the part that was matched by
``replacement``. If the pattern does not match, then return the
``subject`` string unchanged. The match and substitution options affect
these operations as described above.
This method is similar to the native VCL ``regsub`` function, or
``regsuball`` when the ``suball`` option is true, but the syntax of
the replacement string is different. In the replacement string, these
sequences can be used to insert strings:
``$$``
Inserts a dollar character.
``$<n>`` of ``${<n>}``
Inserts the contents of group ``<n>`` captured during the match,
where ``<n>`` can be a number or a name. The number can be 0 to
include the entire matched string. Braces are only required if the
following character would be interpreted as part of the number or
name.
``$*MARK`` or ``${*MARK}``
Insert the name of the last ``(*MARK)`` encountered in the match.
For example, to rewrite URLs with prefixes of the form ``"/~<user>"``
so that their prefix is ``"/u/<user>"`` (and leave other URLs
unchanged)::
sub vcl_init {
new user = pcre2.regex("/~([^/]+)(.*)", anchored=true);
}
sub vcl_recv {
set req.url = user.sub(req.url, "/u/${1}${2}");
}
When the ``sub_extended`` option is false, only the dollar character
is special in the replacement string. When ``sub_extended`` is true,
the replacement syntax also has these capabilites:
* Backslashes in the replacement string are interpreted as escapes,
and special backslash sequences are interpreted as for PCRE2
patterns. For example, ``\n`` denotes newline, and ``\x{ddd}``,
where each ``d`` is a digit, specifies a character code. A backslash
followed by a non-alphanumeric character quotes the character, and
``\Q`` and ``\E`` can be used to quote a longer sequence.
* Four additional escape sequences can be used to force the case of
inserted letters:
* ``\U`` forces upper case for all of the following text until
``\E``, or to the end of the string if there is no ``\E``.
* ``\L`` through ``\E`` or end of string forces lower case.
* ``\u`` and ``\l`` force the next character, if it is a letter, to
upper and lower case, respectively.
Case forcing applies to all inserted characters, including those from
captured groups and in sequences quoted by ``\Q`` through ``\E``.
Sequences ending in ``\E`` do not nest. So for example,
``"\Uaa\LBB\Ecc\E"`` results in ``"AAbbcc"``, and the final ``\E`` has
no effect.
* The "dollar" replacement expressions have an additional capability
inspired by Bash to handle unset capturing groups:
``${<n>:-<string>}``
As with ``${<n>}``, ``<n>`` can be a number or name. If group
``<n>`` is set, then its contents are inserted, otherwise
``<string>`` is expanded and inserted. ``<string>`` may, in turn,
include elements of the replacement syntax that are interpreted
accordingly.
``${<n>:+<string1>:<string2}``
If group ``<n>`` is set, insert the result of expanding
``<string1>``, otherwise insert the result of expanding
``<string2>``.
Colons and escapes in the replacement strings can be escaped with
backslashes.
For example, to rewrite Host headers of the form
``www.<sub1>.<sub2>.<tld>`` to ``<sub2>.<tld>``, and of the form
``www.<sub>.<tld>`` to ``<sub>.<tld>``, while also normalizing the header
to lower-case, and leaving other Host headers unchanged::
sub vcl_init {
new hostsub = pcre2.regex(extended=true, pattern={"
"^www\. # www. prefix
([^.]+) # group 1, "<sub1>"
(?: # non-capturing parentheses
\.([^.]+) # dot, then group 2, "<sub2>"
)? # 0 or 1 of group 2
\.([^.]+)$ # dot, then group 3, "<tld>"
"});
}
sub vcl_recv {
set req.http.Host = hostsub.sub(req.http.Host, sub_extended=true,
replacement="\L${2:+$2:$1}.$3");
}
``.sub()`` fails, returning NULL while logging a ``VCL_Error`` message,
if ``replacement`` is undefined.
.. _func_regex.info_bool:
regex.info_bool
......@@ -103,6 +939,108 @@ regex.info_bool
BOOL regex.info_bool(ENUM {ALLOW_EMPTY_CLASS,ANCHORED,ALT_BSUX,ALT_CIRCUMFLEX,ALT_VERBNAMES,CASELESS,DOLLAR_ENDONLY,DOTALL,DUPNAMES,EXTENDED,FIRSTLINE,MATCH_UNSET_BACKREF,MULTILINE,NEVER_BACKSLASH_C,NEVER_UCP,NEVER_UTF,NO_AUTO_CAPTURE,NO_AUTO_POSSESS,NO_DOTSTAR_ANCHOR,NO_START_OPTIMIZE,NO_UTF_CHECK,UCP,UNGREEDY,USE_OFFSET_LIMIT,UTF,HAS_FIRSTCODEUNIT,MATCH_ATSTART,HAS_LASTCODEUNIT,HAS_BACKSLASHC,HAS_CRORLF,JCHANGED,MATCH_EMPTY}, BOOL compiled=1)
Return true or false about a property of the regex that the object
represents. This method and the other ``.info_*`` methods may be
helpful for debugging and optimizing regular expression matching, for
example by determining whether PCRE2 could enable certain
optimizations for the pattern.
The ENUM determines which property is to be inspected. If the ENUM is any
one of::
ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX, ALT_CIRCUMFLEX,
ALT_VERBNAMES, CASELESS, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED,
FIRSTLINE, MATCH_UNSET_BACKREF, MULTILINE, NEVER_BACKSLASH_C,
NEVER_UCP, NEVER_UTF, NO_AUTO_CAPTURE, NO_AUTO_POSSESS,
NO_DOTSTAR_ANCHOR, NO_START_OPTIMIZE, NO_UTF_CHECK, UCP, UNGREEDY,
USE_OFFSET_LIMIT, UTF
then the return value of ``info_bool()`` indicates whether the
corresponding compile option is true for the pattern. If ``compiled``
is true, then the return indicates whether the option was set to true
after the pattern was compiled, even if it was specified differently
(or left to the default) in the object constructor. If ``compiled``
is false, then the method returns the value of the option as it was
provided in the constructor.
For example, if the compile option ``anchored`` was set to false in
the constructor (or left to the default), PCRE2 may nevertheless
determine that the pattern is anchored if certain conditions are
satisfied (which are described in detail in `pcre2api(3)`_). In that
case, ``info_bool()`` will return true if ``compiled`` is true, and
false if ``compiled`` is false.
``compiled`` is true by default, and is ignored for the other ENUM
values.
The other ENUMs are interpreted as follows:
``HAS_FIRSTCODEUNIT``
If the pattern is unanchored, PCRE2 may determine that there is a
unique code unit (a byte) that must appear at the start of the
matching part of a string. For example, the part of a string that
matches ``(cat|cow|coyote)`` must begin with a
``c``. ``info_bool(HAS_FIRSTCODEUNIT)`` returns true if there is
such a code unit, and false if the pattern is anchored or if no
unique first code unit could be determined. If there is such a first
code unit, it is returned by ``info_str(FIRSTCODEUNIT)``. Note that
in non-UTF mode, the first code unit is the same as the first
character, but for UTF-8 patterns, it may be the first byte in a
multibyte character.
``MATCH_ATSTART``
If the pattern is unanchored and no unique first code unit in the
matching part of the string is known, PCRE2 may determine that the
pattern is constrained to match at the start of the subject string,
or following a newline in the subject. In that case,
``info_bool(MATCH_ATSTART)`` returns true; it returns false if the
pattern is anchored, if a unique first code unit could be found, or
if the pattern could not be determined to match at the start.
``HAS_LASTCODEUNIT``
Under certain circumstances, PCRE2 may determine a rightmost literal
code unit that must exist in a matching string, other than at the
start. This is not necessarily the last byte in the matching part of
a string, but rather the last literal code unit known to be
required. For example, the ``b`` is recorded for this purpose for
the pattern ``ab\d+``, although the ``b`` must be followed by
digits. In there is such a last code unit,
``info_bool(HAS_LASTCODEUNIT)`` returns true, and that value can be
retrieved from ``info_str(LASTCODEUNIT)``. For anchored patterns,
PCRE2 records a possible last literal code unit only if a part of
the pattern that comes before it has variable length. For example,
``z`` is recorded for ``^a\d+z\d+`` (because one or more digits must
come before it), but none is recorded for ``^a\dz\d`` (because
matching strings have a fixed length). As with the first code unit,
the last code unit may be a byte in a multibyte UTF-8 character, if
UTF is enabled for the pattern.
``HAS_BACKSLASHC``
Return true if and only if ``\C`` appears in the pattern.
``HAS_CRORLF``
Return true if and only if the pattern contains explicit matches for
CR or LF characters. These can be literal carriage returns or
linefeeds in the pattern, or the escape sequences ``\r`` or ``\n``.
``JCHANGED``
Return true if and only if the pattern contains ``(?J)`` or ``(?-J)``
to enable or disable JIT-matching.
``MATCH_EMPTY``
Return true if and only if PCRE2 determines that the pattern might
match the empty string. For certain complex patterns (with recursive
subroutines), it may not be possible to determine; in that case,
PCRE2 cautiously returns true.
Example::
# To determine if the FIRSTCODEUNIT optimization could be applied.
if (myregex.info_bool(HAS_FIRSTCODEUNIT)) {
std.log("First matching char in the pattern = "
+ myregex.info_str(FIRSTCODEUNIT));
}
.. _func_regex.info_int:
regex.info_int
......@@ -112,6 +1050,65 @@ regex.info_int
INT regex.info_int(ENUM {BACKREFMAX,CAPTURECOUNT,JITSIZE,MATCHLIMIT,MAXLOOKBEHIND,MINLENGTH,RECURSIONLIMIT,SIZE})
Return an integer that describes a property of the pattern that the
object represents, as determined by the ENUM.
``BACKREFMAX``
Return the highest back reference within the pattern. Remember that
named groups also acquire group numbers, and thus count towards the
highest backref. A conditional subpattern such as ``(?(3)a|b)``,
which checks if a capturing group is set, also counts as a
backref. If there are no backrefs, return 0.
``CAPTURECOUNT``
Return the highest capturing group number in the pattern. If the
``(?|`` construct (which allows duplicate group numbers, see
`pcre2pattern(3)`_) is not used in the pattern, then the value
returned is also the total number of capturing groups.
``JITSIZE``
Return the size of JIT-compiled code for the pattern. Returns 0 if
the pattern was not JIT-compiled.
``MATCHLIMIT``
If the pattern contains the construct ``(*LIMIT_MATCH=nnnn)`` to set
the match limit (see the match option ``match_limit`` above), then
return the limit that it sets. Returns -1 if no such value has been
set.
``MAXLOOKBEHIND``
Return the number of characters in the longest lookbehind assertion
in the pattern. Returns 0 if there are no lookbehinds.
``MINLENGTH``
If PCRE2 has determined that there is a lower bound for the length
of a string that may match the pattern, then return that
value. Returns 0 if no lower bound is known. This is not necessarily
the same as the shortest string that may possibly match; but any
string that does match must be at least that long.
``RECURSIONLIMIT``
If the pattern contains the construct ``(*LIMIT_RECUSRION=nnnn)``
(see the match option ``recursion_limit`` above), then return the
value that was set. Returns -1 if no such value has been set.
``SIZE``
Return the size of the compiled pattern, as used for the
interpretive matcher, in bytes. This is independent of the value
returned by ``info_int(JITSIZE)``.
Example::
# To determine if a lower bound on the length of matching strings
# could be found.
if (myregex.info_int(MINLENGTH) != 0) {
std.log("Lower bound on matching string length = "
myregex.info_int(MINLENGTH));
}
else {
std.log("No lower bound for matching string lengths found");
}
.. _func_regex.info_str:
regex.info_str
......@@ -121,6 +1118,64 @@ regex.info_str
STRING regex.info_str(ENUM {BSR,FIRSTCODEUNIT,FIRSTCODEUNITS,LASTCODEUNIT,NEWLINE}, STRING sep=" ")
Return a string that describes a property of the pattern represented
by the object, as determined by the ENUM. The ``sep`` parameter is
only relevant when the ENUM ``FIRSTCODEUNITS`` is used, as described
below.
``BSR``
Return ``"UNICODE"``, meaning that ``\R`` in the pattern matches any
Unicode line ending sequence, or ``"ANYCRLF"``, meaning that it
matches only CR, LF or CRLF.
``FIRSTCODEUNIT``
If PCRE2 determines that there is a unique first code unit that must
begin the matching part of a string (as described above for
``info_bool(HAS_FIRSTCODEUNIT)``), then return that code unit in a
string. Returns the empty string if no such code unit was
determined; this is also the case if the pattern is anchored. Recall
that a code unit corresponds to a character in non-UTF mode, but may
be a byte in a multibyte character when UTF-8 is enabled. The code
unit is not escaped in the return string.
``FIRSTCODEUNITS``
(Note the difference between ``FIRSTCODEUNIT``, singular, and
``FIRSTCODEUNITS``, plural.) For an unanchored pattern, if PCRE2
cannot determine a unique code unit that must appear at the start of
the matching part of a string, it may be able to determine a set of
such code units. For example, if the pattern starts with ``[abc]``,
then the matching part must begin with ``a``, ``b`` or ``c``. In
that case, ``info_str(FIRSTCODEUNITS)`` returns those code units in
a string, separated by the string given as ``sep``. The default
value of ``sep`` is ``" "`` (the string containing one space). If
the pattern is anchored, or if a unique first code unit could be
found, or if no set of first code units could be found, then return
the empty string.
``LASTCODEUNIT``
If PCRE2 has recorded a rightmost literal code unit that must exist
in a matching string, as described for ``info_bool(HAS_LASTCODEUNIT)``
above, then return that code unit in a string. Returns the empty
string if no such code unit was recorded.
``NEWLINE``
Return a string describing the default sequence recognized as a
"newline" for the pattern:
* ``"CR"`` (carriage return)
* ``"LF"`` (linefeed)
* ``"CRLF"`` (CR followed by LF)
* ``"ANYCRLF"`` (CR, LF or CRLF)
* ``"UNICODE"`` (any Unicode line-ending sequence)
Example::
# Determine if a set of first matching characters could be found.
std.log("First matching chars: " + myregex.info_str(FIRSTCODEUNITS));
Regex functional interface
--------------------------
.. _func_match:
match
......@@ -130,6 +1185,31 @@ match
BOOL match(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0)
Compile the ``pattern`` and return true if it matches
``subject``. Compilation and matching are subject to the given
options, or default options. The compiled pattern is discarded after
use, and ``pattern`` is compiled on every invocation.
The call fails, logging an ``VCL_Error`` message and returning false,
if:
* ``pattern`` is undefined.
* The compile fails (for example due to a syntax error).
* Any compile or match option is illegal as described above.
As with the ``.match()`` method, if ``subject`` is undefined, then it
is assumed to be the empty string.
Example::
# Match a request header against a pattern provided in a response
# header.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
call do_on_match;
}
.. _func_backref:
backref
......@@ -139,6 +1219,33 @@ backref
STRING backref(PRIV_TASK, INT ref, STRING fallback="**BACKREF FUNCTION FAILED**")
Return the `nth` captured subexpression from the most recent
successful call of the ``match()`` function in the current client or
backend context, or a fallback string if the capture fails. The
default ``fallback`` is ``"**BACKREF FUNCTION FAILED**"``.
As with the ``regex.backref()`` method, ``fallback`` is returned
after any failed invocation of the ``match()`` function, or if there
is no captured group corresponding to the backref number. The function
is not affected by native VCL regex operations, or any other method or
function of the VMOD except for the ``match()`` function.
The function fails, returning ``fallback`` and logging a ``VCL_Error``
message, under the same conditions as the corresponding method:
* ``fallback`` is undefined.
* ``ref`` is out of range.
* The ``match()`` function was never called in this context.
* The pattern failed to compile for the previous ``match()`` call.
Example::
# Match against a pattern provided in a response header, and capture
# subexpression 1.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
set resp.http.X-Group-1 = pcre2.backref(1);
}
.. _func_namedref:
namedref
......@@ -148,6 +1255,29 @@ namedref
STRING namedref(PRIV_TASK, STRING name, STRING fallback="**NAMEDREF FUNCTION FAILED**")
Return the captured subexpression designated by ``name`` from the most
recent successful call of the ``match()`` function in the current
context, or ``fallback`` in case of failure. The default fallback is
``"**NAMEDREF FUNCTION FAILED**"``.
The function returns ``fallback`` when the previous invocation of the
``match()`` function failed, and is only affected by use of the
``match()`` function. The function fails, returning ``fallback`` and
logging a ``VCL_Error`` message, under the same conditions as the
corresponding method:
* ``fallback`` is undefined.
* ``name`` is undefined or the empty string.
* There is no such named group.
* ``match()`` was not called in this context.
* The pattern failed to compile for the previous ``match()`` call.
Example::
if (pcre2.match(resp.http.X-Pattern-With-Names, req.http.X-Subject)) {
set resp.http.X-Group-Foo = pcre2.namedref("foo");
}
.. _func_sub:
sub
......@@ -157,6 +1287,34 @@ sub
STRING sub(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, STRING replacement, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)
Compile ``pattern``, and if it matches ``subject``, then return a string
formed by replacing the part that matched by ``replacement``. If the
pattern does not match, return ``subject`` unchanged. The compile, match
and substitution options affect all of these operations, as described
above.
The syntax of the ``replacement`` string, as modified if the
``sub_extended`` option is true, is the same as documented above for
the ``.sub()`` method.
``sub()`` fails, returning NULL and logging a ``VCL_Error`` message,
if:
* Either of ``pattern`` or ``replacement`` is undefined.
* ``pattern`` cannot be compiled.
Example::
# If the beresp header X-Sub-Letters contains "b+", and Host contains
# "www.yabba.dabba.doo.com", then set X-Yada to
# "www.yada.dabba.doo.com".
set beresp.http.X-Yada = re2.sub(beresp.http.X-Sub-Letters,
bereq.http.Host, "d");
Library configuration
---------------------
.. _func_config_bool:
config_bool
......@@ -166,6 +1324,35 @@ config_bool
BOOL config_bool(ENUM {JIT,STACKRECURSE,UNICODE})
Return true or false about a property of the PCRE2 library to which
the VMOD is linked, identified by the ENUM. The ``config_*`` functions
make it possible to discover features of the library that were chosen
when it was built.
``JIT``
Return true if the library supports just-in-time compilation and
matching.
``STACKRECURSE``
Return true if internal recursion for the PCRE2 matcher uses the
system stack to maintain its state, which is the usual way the
library is built. If false is returned, PCRE2 uses blocks of data on
the heap rather than recursive function calls.
``UNICODE``
Return true if Unicode support is available. If so, then the compile
option ``utf`` can be used to define a pattern and the strings
against which it is matched as UTF-8 strings.
Example::
if (pcre2.config_bool(JIT)) {
std.log("JIT supported for PCRE2");
}
else {
std.log("JIT not supported for PCRE2");
}
.. _func_config_str:
config_str
......@@ -175,6 +1362,43 @@ config_str
STRING config_str(ENUM {BSR,JITTARGET,NEWLINE,UNICODE_VERSION,VERSION})
Return a string describing a property of the PCRE2 library.
``BSR``
Return a string indicating what the ``\R`` escape sequence matches
by default: ``UNICODE`` for Unicode line-ending sequences, or
``ANYCRLF`` for only CR, LF and CRLF. This is the default that holds
if no value is given for the compile option ``bsr``.
``JITTARGET``
Return a string identifying the architecture for which the JIT
compiler is configured. If JIT is not enabled, the returned string
contains the phrase ``"JIT not supported"``.
``NEWLINE``
Return a string identifying the character sequence that is recognized
by default as a newline:
* ``"CR"`` (carriage return)
* ``"LF"`` (linefeed)
* ``"CRLF"`` (CR followed by LF)
* ``"ANY"`` (any Unicode line ending)
* ``"ANYCRLF"`` (any of CR, LF or CRLF)
This is the default if no value is given for the compile option
``newline``.
``UNICODE_VERSION``
If Unicode is supported by the library, return the Unicode version
string. If not, return ``"Unicode not supported"``.
``VERSION``
Return the PCRE2 version string.
Example::
std.log("Linked to PCRE2 version " + pcre2.config_str(VERSION));
.. _func_config_int:
config_int
......@@ -184,6 +1408,32 @@ config_int
INT config_int(ENUM {LINKSIZE,MATCHLIMIT,PARENSLIMIT,RECURSIONLIMIT})
Return an integer describing a property of the PCRE2 library.
``LINKSIZE``
Return the number of bytes used for internal linkage (offsets) in
compiled regular expressions. This determines the size of the
largest possible pattern; the default link size of 2 allows for
patterns of up to 64K bytes.
``MATCHLIMIT``
Return the default value for the ``match_limit`` compile option,
which limits the effort of the matcher when no match is found.
``PARENSLIMIT``
Return the default value of the ``parens_nest_limit`` compile
option, which limits the depth of parenthesis nesting in patterns,
and hence the use of the stack during compilation.
``RECURSIONLIMIT``
Return the default value of the ``recursion_limit`` compile option,
which limits the depth of recursion, and hence stack usage, for the
the interpretive (non-JIT) matcher.
Example::
std.log("Default PCRE2 match limit = " + config_int(MATCHLIMIT));
.. _func_version:
version
......@@ -197,29 +1447,106 @@ Returns the version string for this VMOD.
Example::
std.log("Using VMOD pcre2 version " + pcre2.version());
std.log("Using VMOD pcre2 version " + pcre2.version());
REQUIREMENTS
============
This VMOD requires Varnish ...
LIMITATIONS
===========
...
This VMOD has been tested with Varnish version 5.1.2 and PCRE2 version
10.23.
INSTALLATION
============
See `INSTALL.rst <INSTALL.rst>`_ in the source repository.
LIMITATIONS
===========
The VMOD allocates Varnish workspace for a variety of purposes:
* The string returned by the ``sub`` method and function.
* Buffers for temporary data structures needed by the PCRE2 library,
for example to save information about a match for use by the
``backref`` and ``namedref`` methods and functions.
* A copy of the subject string for the ``match`` method and function,
if it is not already in workspace, so that it can be safely accessed
by ``backref`` and ``namedref``.
* Return strings for some uses of ``info_str`` and ``config_str``.
* Temporary buffers for error message strings from the PCRE2 library.
If VMOD operations fail with the "out of space" error message in the
Varnish log (with the ``VCL_Error`` tag), increase the varnishd runtime
parameters ``workspace_client`` and/or ``workspace_backend``.
The PCRE2 interpretive and JIT matchers are backtracking matchers, and
the interpretive matcher is recursive, using part of the stack on each
recursive call (in the default library configuration). For patterns
with large search spaces, this can lead to slow matches, high CPU
usage, and stack overflow due to deep recursion, which typically
causes Varnish to segfault. This has occasionally been the subject of
issues reported to the Varnish project.
For most common uses of regular expressions in VCL, PCRE2 is very fast
and has minimal resource consumption. This depends strongly on how the
regex is written -- a well-crafted pattern helps the matcher limit
backtracking, fail early on non-matches, and make use of some the
optimizations that PCRE2 can apply. Some of the compile and match
options also help to optimize the match operation. Which of these
measures is possible depends, of course, on what you want the regex to
do.
Writing optimized regexen is a very broad subject, beyond the scope of
this manual. There is some advice in `pcre2perform(3)`_, and in many
other sources.
If your use case requires patterns and subject strings that can lead
to very large search spaces, consider using some of the options
available in the VMOD that limit excessive effort for unsuccessful
matches. In particular, consider lowering the match options
``match_limit`` and ``recursion_limit``. You can also use
``offset_limit`` to set a maximum length to search for a match in the
subject string (for which you will have to set the compile option
``use_offset_limit``). These may cause the matcher to halt before it
has exhausted all possibilities for a match (but it appears to be
common that, if the matcher has to search for a long time, then there
was never any match to be found).
If you encounter stack overflow, it may help to increase the stack
size (by changing ``limits.conf`` or calling ``ulimit -s`` before
starting Varnish). Since Varnish 4.1, you can also increase the
varnishd parameter ``thread_pool_stack``. Bear in mind that this
increases the total RAM usage of Varnish.
ACKNOWLEDGEMENTS
================
A tip of the hat to Philip Hazel, who released the first version of
PCRE twenty years before this VMOD was developed.
A few sentences in this manual are identical to or very closely track
phrasings in the PCRE2 documentation, if there was simply no better
way to say what needs to be said.
SEE ALSO
========
* varnishd(1)
* vcl(7)
* source repository: https://code.uplex.de/uplex-varnish/libvmod-pcre2
* pcre2(3)
* PCRE web site: http://www.pcre.org/
* VMOD source repository: https://code.uplex.de/uplex-varnish/libvmod-pcre2
.. _pcre2(3): http://www.pcre.org/current/doc/html/pcre2.html
.. _pcre2pattern(3): http://www.pcre.org/current/doc/html/pcre2pattern.html
.. _pcre2syntax(3): http://www.pcre.org/current/doc/html/pcre2syntax.html
.. _pcre2api(3): http://www.pcre.org/current/doc/html/pcre2api.html
.. _pcre2unicode(3): http://www.pcre.org/current/doc/html/pcre2unicode.html
.. _pcre2perform(3): http://www.pcre.org/current/doc/html/pcre2perform.html
COPYRIGHT
=========
......
......@@ -9,13 +9,602 @@ $Module pcre2 3 access the pcre2 regular expression library
::
new OBJECT = ...
# object interface
new OBJECT = pcre2.regex(STRING pattern [, compile options])
BOOL <OBJ>.match(STRING subject [, match options])
STRING <OBJ>.backref(INT ref)
STRING <OBJ>.namedref(STRING name)
STRING <OBJ>.sub(STRING subject, STRING replacement [, match options]
[, substitution options])
BOOL <OBJ>.info_bool(ENUM)
INT <OBJ>.info_int(ENUM)
STRING <OBJ>.info_str(ENUM)
# function interface
BOOL pcre2.match(STRING pattern, STRING subject [, compile options]
[, match options])
STRING pcre2.backref(INT ref)
STRING pcre2.namedref(STRING name)
STRING pcre2.sub(STRING pattern, STRING subject, STRING replacement
[, compile options] [, match options]
[, substitution options])
# library configuration
BOOL pcre2.config_bool(ENUM)
INT pcre2.config_int(ENUM)
STRING pcre2.config_str(ENUM)
DESCRIPTION
===========
This Varnish Module (VMOD) provides access to the PCRE2 regular
expresion library.
expression library. PCRE2 is the Perl-compatible regular expression
library with a revised API, the successor to the PCRE library that
implements native regexen in Varnish VCL. See `pcre2(3)`_ and the
manuals that it references for details about the PCRE2 library.
PCRE2, by itself, does not change regular expressions from the
perspective of the end user -- the syntax and semantics of patterns
and pattern matching remained largely the same at the time PCRE2 was
introduced. The new library is a refactoring of the internal API,
which is transparent to the user, and the VMOD endeavors to make use
of the new internal features advantageously for VCL.
Some of the differences between the VMOD and native VCL regexen are:
* The VMOD provides methods and functions to retrieve back references
after a match that are easier to use than the idiom with the
``regsub`` function that is necessary in native VCL. It also
provides the means to retrieve references to named capturing groups.
* The functional interface makes it possible to use patterns that are
not known until runtime.
* PCRE2 introduces a new native substitution function, similar to the
``regsub`` and ``regsuball`` functions in VCL, except that the
substitution syntax is different and provides more features.
* Parameters that limit the depth of recursion and backtracking in
match operations, which are set globally in Varnish, can be set for
individual matches in the VMOD.
* The VMOD can support matching against UTF-8 strings, if it is
running against a PCRE2 library that was built to support Unicode.
* The VMOD exposes considerably more functionality of the underlying
library. VCL provides a general-purpose regular expression facility
-- PCRE could be easily replaced as its regex engine. The VMOD is
meant to be specific to PCRE2, and makes a full range of its
features available in VCL.
* The VMOD provides methods and functions that allow you to inspect
properties of patterns and of the library. These are not likely to
be useful on the fast path of production deployments, and are not
optimized for that. But they may be useful during development to
debug and optimize regex matching.
Since the introduction of PCRE2, the original PCRE library is being
maintained for bugfixes, but development of new features and
optimizations are only being done for PCRE2. So the VMOD will make it
possible to take advantage of improvements in the library as they are
released.
Here are some simple usage examples::
# regex objects are created in vcl_init, and the regular expressions
# are compiled when VCL is loaded.
sub vcl_init {
# A regex to match the "foo" cookie, and capture its value.
new foo = pcre2.regex("\bfoo=([^;,\s]+\b)");
# A regex to match a URL beginning with the prefix "/bar/", and
# capture its suffix.
new bar = pcre2.regex("^/bar/(.+)");
}
sub vcl_recv {
# If the cookie header contains "foo", then assign its value
# to another header.
if (foo.match(req.http.Cookie)) {
set req.http.X-Foo-Value = foo.backref(1);
}
# If the URL begins with "/bar/", then replace the prefix with
# "/baz/quux/".
if (bar.match(req.url)) {
set req.url = "/baz/quux/" + bar.backref(1);
}
}
Object and functional interfaces
--------------------------------
The VMOD provides regular expression operations by way of the
``regex`` object interface and a functional interface. For ``regex``
objects, the pattern is compiled at VCL initialization time, and the
compiled pattern is re-used for each invocation of its
methods. Compilation failures (due to errors in the pattern) cause
failure at initialization time, and the VCL fails to load. The
``.backref()`` and ``.namedref()`` methods refer back to the last
invocation of the ``.match()`` method for the same object. The
``.sub()`` method also re-uses an object's compiled pattern.
The functional interface provides the same set of operations, but the
pattern is compiled at runtime on each invocation of the ``match()``
and ``sub()`` functions (and then discarded). Compilation failures are
reported as errors in the Varnish log. The ``backref()`` and
``namedref()`` functions refer back to the last invocation of the
``match()`` function, for any pattern.
Compiling a pattern at runtime on each invocation is considerably more
costly than re-using a compiled pattern. So for patterns that are
fixed and known at VCL initialization, the object interface should be
used. The functional interface should only be used for patterns whose
contents are not known until runtime.
Compile, match and substitution options
---------------------------------------
The VMOD has unusually long lists of parameters for its methods and
functions -- over 40 for the ``sub()`` function, for example. But
nearly all of these have default values, and it is only necessary to
specify options in VCL that differ from the defaults.
The optional parameters affect the interpretations of patterns and the
operation of matches and substitutions, and come in three groups:
* *Compile* options, used wherever a pattern is compiled: in the
``regex`` object constructor, and the ``match()`` and ``sub()``
functions.
* *Match* options, used wherever a match is performed: in the
``match`` and ``sub`` methods and functions.
* *Substitution* options, used in the ``sub`` method and function.
The options have call scope, meaning that they are evaluated only once
for each invocation of a function or method at its particular location
in the VCL source, on the first invocation after the VCL instance is
loaded. The options are then cached and re-used for all subsequent
invocations, and cannot be changed (until a new VCL instance is
loaded).
Compile options
~~~~~~~~~~~~~~~
Compile options define properties of patterns. See `pcre2pattern(3)`_
for details of PCRE2 pattern syntax, and `pcre2syntax(3)`_ for a quick
reference.
The default value of all of the BOOL options is **false**.
See also `JIT compilation and matching`_ below.
``allow_empty_class``
If true, then a pattern may include ``[]`` to denote an empty
character class. This, in part, supports compatibility with regexen
in ECMAscript (also known as Javascript). By default, a closing
square bracket after an opening one is interpreted as a character in
the class (and ``]`` must appear later in the pattern).
``alt_bsux``
(Referring to "backslash-u" and "backslash-x".) If true, then three
escape sequences are interpreted differently (for compatibility with
ECMAscript):
* ``\U`` matches an upper case ``U`` character. By default, ``\U``
causes a compile error.
* ``\u`` matches a lower case ``u``, unless it is followed by four
hexadecimal digits, in which case the hex number identifies the
code point to be matched. By default, ``\u`` causes a compile
error.
* ``\x`` matches a lower case ``x``, unless it is followed by four
hex digits, in which case it identifies the code point to match.
By default, ``\x`` must always be followed by zero to two hex
digits to identify a one-byte character (for example, ``\xz``
matches binary zero followed by ``z``).
``alt_circumflex``
If true, and if ``multiline`` is also true, then the ``^``
meta-character matches after a newline appearing as the last
character in a string. By default, ``^`` does not match after
a terminating newline.
``alt_verbnames``
If true, then backslash processing may be applied to verb names in
verb sequences such as ``(*MARK:NAME)``, so that the name can, for
example, include a closing parenthesis as ``\)`` or between ``\Q``
and ``\E``. By default, no processing is applied to verb names, and
they end at the first closing parenthesis (regardless of any
backslash).
``anchored``
If true, then the pattern is anchored, meaning that it is
constrained to match at the starting point of a string. This may
also be achieved with constructs in the pattern.
``bsr``
(For "backslash-R".) If this ENUM value is set, then it determines
which sequences are matched by ``\R``. If set to ``UNICODE``, then
``\R`` matches any UTF-8 newline sequence. If set to ``ANYCRLF``,
then it matches CR (carriage return, or ``\r``), LF (linefeed, or
``\n``), or CR followed by LF. By default, ``\R`` matches the
sequence chosen when the PCRE2 library was built, which can be
determined from ``config_str(BSR)`` (the default default is
Unicode). See `pcre2pattern(3)`_ for details about ``\R``.
``caseless``
If true, then matches for this pattern are case-insensitive. This
may also be achieved with ``(?i)`` in the pattern.
``dollar_endonly``
If true, then the ``$`` metacharacter matches only at the end of a
string. By default, ``$`` also matches before newlines within the
string (but not before newlines that come immediately after a
newline). ``dollar_endonly`` is ignored when ``multiline`` is true.
``dotall``
If true, then the ``.`` metacharacter matches any character,
including newlines. But it only ever matches one character, even if
newlines are coded as CRLF. By default, dots do not match
newlines. The effect of ``dotall`` can also be achieved with
``(?s)`` in the pattern.
``dupnames``
If true, then the names used for named capturing groups are not
required to be unique. By default, names for capturing groups may
only be used once.
``extended``
If true, then pattern syntax is permitted to contain constructs that
serve as self-documentation:
* Most whitespace is ignored, except when escaped or inside a
character class (and a few other exceptions detailed in
`pcre2api(3)`_).
* All characters between an unescaped ``#`` and the next newline are
ignored, and can be used as comments.
For example, this is a self-documenting declaration of a pattern
that matches IPv6 addresses::
new ipv6 = pcre2.regex(extended=true, caseless=true, pattern=
{"^(?!:) # colon disallowed at start
(?: # start of item
(?: [0-9a-f]{1,4} | # 1-4 hex digits or
(?(1)0 | () ) ) # fail if null previously matched
: # followed by colon
){1,7} # end item; 1-7 of them required
[0-9a-f]{1,4} $ # final hex number at end of string
(?(1)|.) # there was an empty component
"});
The effect of ``extended`` can also be achieved with the ``(?x)``
option in a pattern.
``firstline``
If true, an unanchored pattern must match before or at the first
newline in the subject string (though the matched text may continue
over a newline). If the ``offset_limit`` option is also set for a
match, then the match must occur within the offset limit and in the
first line.
``locale``
If ``locale`` is set to a string matching a locale that is available
on the system on which Varnish is running, then that locale is used
for the pattern to determine which characters are letters, digits,
upper and lower case, and so forth. Hence this option affects the
interpretation of constructs such as ``\w`` and ``\d``, the
``caseless`` option, and so on. This only applies to single-byte
characters.
If ``locale`` is set to a string that is not recognized as a locale,
then compilation fails.
By default, PCRE2 uses tables established when the library is built
to recognize character properties; normally, these only recognize
ASCII characters.
Quoting `pcre2api(3)`_:
The use of locales with Unicode is discouraged. If you are
handling characters with code points greater than 128, you should
either use Unicode support, or use locales, but not try to mix the
two.
``match_unset_backref``
If true, then a back reference to an unset capturing group matches
an empty string; thus ``(\1)(a)`` successfully matches ``a``. This
makes the pattern similar to an ECMAscript pattern. By default, an
unset backref causes the matcher to backtrack, and possibly fail.
``max_pattern_len``
If this INT value is greater than 0, then it sets a maximum length
for the pattern string to be compiled. If the pattern is longer, then
compilation fails.
``multiline``
If true, then the ``^`` and ``$`` meta-characters match immediately
after and before internal newlines in the subject string, respectively,
in addition to matching at the start and end of the string. By default,
the start and end anchors only match at the beginning and end of the
string, regardless of internal newlines. The effect of ``multiline``
can also be achieved with ``(?m)`` in the pattern.
``never_backslash_c``
If true, then ``\C`` may not be used in a pattern, and causes
compile failure. ``\C`` always matches exactly one byte, even in UTF
mode, and may lead to unpredictable effects if it matches in the
middle of a multibyte UTF-8 character. ``\C`` may have been
prohibited by a build-time option in the library, which can be
discovered by calling ``config_bool(NEVER_BACKSLASH_C)``.
``never_ucp``
If true, then Unicode properties are not used to interpret ``\B``,
``\b``, ``\D``, ``\d``, ``\S``, ``\s``, ``\W``, ``\w``, and some of
the POSIX character classes in the pattern. It is then impossible to
activate this facility by including ``(*UCP)`` at the start of the
pattern. If ``never_ucp`` and ``ucp`` are both set to true, then
the compile fails.
``newline``
If this ENUM value is set, it determines which characters are to be
matched as newlines in the pattern. It can be set to:
* ``CR`` (carriage return)
* ``LF`` (linefeed)
* ``CRLF`` (CR followed by LF)
* ``ANYCRLF`` (CR, LF or CRLF)
* ``UNICODE`` (any Unicode line-ending sequences)
By default, the newline sequence chosen for the PCRE2 library when
it was built is used, which can be determined from
``config_str(NEWLINE)``.
``no_auto_capture``
If true, then numbered capturing groups are disabled in the pattern.
Any opening parenthesis not followed by ``?`` is then interpreted as
if it were followed by ``?:`` (that is, it forms a non-capturing
group). Named capturing groups can still be used, and these also
acquire a capturing group number, so ``namedref`` and ``backref``
can still be used (but only for the named groups).
``no_auto_possess``
If true, then the "auto-possessification" optimization is disabled
for the pattern, which for example interprets ``a+b`` as ``a++b``,
using the "possessive quantifier", to prevent backtracks into ``a+``
that can never be successful. If the option is true, then the full
unoptimized search is run.
``no_start_optimize``
If true, then some optimizations for the start of the match are
disabled. This has the effect that certain constructs in the
pattern, such as ``(*COMMIT)`` or ``(*MARK)``, are evaluated at
every possible starting position in the string, while they may have
been skipped when the optimizations are applied. Thus this option
may change the result of ``match`` calls in patterns that include
such constructs. See `pcre2api(3)`_ for details.
``no_utf_check``
If this option and ``utf`` are both true, then validity checks to
determine if the pattern is a valid UTF string are disabled. This
may save CPU usage and time for the ``match()`` and ``sub()``
functions, which compile patterns on every invocation, and check UTF
strings for validity by default. But you should only do so if you
are sure that the inputs are valid, because running matches in UTF
mode against invalid strings is undefined, and may cause Varnish to
crash or loop. By default, invalid UTF strings in the pattern cause
the compile to fail in UTF mode. See `pcre2unicode(3)`_ for details.
``parens_nest_limit``
If this INT value is greater than 0, it sets the maximum depth of
parenthesis nesting in a pattern. It applies to all kinds of
parentheses, not just captruing groups. The limit prevents patterns
from using too much of the stack when compiled, and may be useful
for the functional interface, for which patterns are compiled at
runtime. By default, the nesting limit set for the PCRE2 library at
build time is imposed, which is returned by
``config_int(PARENSLIMIT)``.
``ucp``
If this option and ``utf`` are both true, then Unicode properties
are used to interpret ``\B``, ``\b``, ``\D``, ``\d``, ``\S``,
``\s``, ``\W``, ``\w``, and some of the POSIX character classes in
the pattern. The same effect can be achieved by including ``(*UCP)``
at the start of the pattern. By default, only ASCII characters are
considered for these constructs, which is faster than considering
Unicode properties. If Unicode was disabled at build time for the
PCRE2 library, which can be discovered by calling
``config_bool(UNICODE)``, then the compile fails when this option is
true. Compiles also fail if this option and ``never_ucp`` are both
true. See `pcre2unicode(3)`_ for details about Unicode character
properties.
``ungreedy``
If true, then the "greediness" of quantifiers in the pattern is
inverted, so that they are not greedy by default, but become
greedy when followed by ``?``. The same effect can be achieved
by including ``(?U)`` in the pattern.
``use_offset_limit``
This option must be set to true for a pattern if you intend to use
the ``offset_limit`` parameter in match and substitution operations
to limit how far a string is searched for an unanchored match. If an
``offset_limit`` is set for an invocation of the ``match`` or
``sub`` methods or functions, but this option was not set to true
for the pattern, then then the match fails.
``utf``
If true, then both the pattern and the strings against which it is
matched are processed as UTF-8 strings. If Unicode support was
disabled when the PCRE2 library was built, which can be determined
from ``config_bool(UNICODE)``, then the compile fails when ``utf``
is true. See `pcre2unicode(3)`_ for details about Unicode support in
PCRE2.
Match options
~~~~~~~~~~~~~
Match options affect the operation of matching in the ``match`` and
``sub`` methods and functions. By default, all of the BOOL options
are **false**. The INT options are 0 by default (meaning that they
are ignored, and the global defaults hold). The INT options MAY NOT
be less than 0; if they are, then the match fails.
``anchored``
If true, then the match is constrained to match at the start of the
string, regardless of whether the pattern is anchored. By default, a
match is searched for anywhere in the string if the pattern is not
anchored.
``len``
If this INT value is greater than 0, it sets the length of the
subject string to be matched. By default, the full string is matched.
``match_limit``
If this INT value is greater than 0, it sets a limit to the effort
used by the PCRE2 matching function to find a match. This can
prevent matches from excessive backtracking, if there is a very
large search space but a match is never found. It is equivalent to
the varnishd parameter ``pcre_match_limit``, except that it applies
only to the match operation in which it was set, not globally. The
varnishd parameters for PCRE have no effect on this VMOD. By
default, the match limit is imposed that was set for the PCRE2
library at build time, which can discovered from
``config_int(MATCHLIMIT)``.
``not_bol``
If true, the first character of the subject is string is not
considered to be the beginning of a line, so the ``^`` metacharacter
does not match before it. If the compile option ``multiline`` was
not set to true for the pattern, then ``^`` never matches. This
option only affects the circumflex metacharacter.
``not_eol``
If true, the end of the subject string is not considered to be the
end of a line, so the ``$`` metacharacter does not match after it.
If ``multiline`` was not set to true for the pattern, then ``$``
never matches. This option only affects the dollar metacharacter.
``not_empty``
If true, then the empty string is not a valid match. If the matcher
finds an empty match, then it considers other alternatives, and if
no other valid matches are found, then the match fails.
``not_empty_atstart``
If true, then the empty string is not a valid match at the start of
the subject string. An empty string match later in the subject is
permitted.
``no_jit``
If true, then the just-in-time matcher is not used, even when the
pattern was compiled for JIT. In that case, PCRE2's "traditional"
interpretive matcher is used (as is always the case if JIT is not
available, or if the pattern was not JIT-compiled). If ``not_jit``
is true for an invocation of the ``match()`` or ``sub()`` functions,
which compile a pattern on every call, then the pattern is also not
JIT-compiled. See `JIT compilation and matching`_ below.
``no_utf_check``
If true, then the subject is not checked for validity as a UTF-8
string when matched against a pattern for which ``utf`` was set to
true. This may speed up matching, but should only be done if you
are sure that the inputs are valid UTF-8. By default, UTF validity
is checked for matches against patterns that were compiled with
``utf``.
``offset_limit``
If this INT value is greater than 0, it limits how far an unanchored
search can advance in the subject string. For example, if the
pattern ``abc`` is matched against the string ``"123abc"`` and the
offset limit is less than 3, the match fails. To use this parameter,
the compile option ``use_offset_limit`` must have been set to true
for the pattern at compile time; otherwise the match fails. By
default, unanchored matches are searched for until the end of the
string.
``recursion_limit``
If this INT value is greater than 0, then it limits the depth of
recursion for matches using the interpretive matcher. It is
equivalent to the varnishd parameter ``pcre_match_limit_recursion``,
but only applies to the individual match. This limits the depth of
recursion and use of the stack for matches that may cause excessive
recursion and stack overflow (which usually causes Varnish to
crash). The limit is not relevant to the JIT matcher, and is ignored
for JIT matching. By default, the recursion limit set for the PCRE2
library at build time applies, which can be determined from
``config_int(RECURSIONLIMIT)``.
Substitution options
~~~~~~~~~~~~~~~~~~~~
The ``sub`` method and function use all of the match options (since
they run a match), and the following additional options. (The ``sub``
function also uses the compile options, since it compiles a pattern.)
``suball``
If true, then the substitution iterates over the subject string and
replaces every matching substring, making the substitution similar
to the native VCL ``regsuball`` function. By default, only the first
matching substring is replaced, making the substitution similar to
VCL's ``regsub`` function.
``sub_extended``
If true, then an extended syntax is enabled for the replacement
string. Details of the replacement syntax are documented for the
``.sub()`` method below.
``unknown_unset``
If true, then references to capturing groups in the replacement
string that do not appear in the pattern are treated as unset
groups. By default, unknown references cause the substitution to
fail. Use this option with care, because it causes misspelled group
names or numbers to be silently ignored.
``unset_empty``
If true, then unset capturing groups (including unknown groups when
``unknown_unset`` is also true) are replaced as empty strings. By
default, an attempt to insert an unset group causes the substitution
to fail.
JIT compilation and matching
----------------------------
PCRE2 supports just-in-time compilation for patterns, and a matcher to
go with it. JIT is a heavyweight optimization that may greatly speed
up matching, but requires extra processing at pattern compilation
time. The VMOD supports JIT if it was enabled for the PCRE2 library
when it was built, which can be determined from ``config_bool(JIT)``.
If JIT is available, then it is always applied to the compilation of
patterns in the ``regex`` object constructor. By default it is also
applied when patterns are compiled at runtime in the ``match`` and
``sub`` methods and functions, unless the ``no_jit`` option is true.
For patterns compiled at runtime, it may be worth it to turn off JIT,
if the overhead for JIT-compiles outweighs the advantage of JIT
matching.
If JIT is not available, then PCRE2 always uses the interpretive
matcher.
Unicode
-------
The VMOD only links to the 8-bit version of PCRE2, and hence can
support UTF-8 if Unicode was enabled when the library was built. The
VMOD does not support UTF-16 or UTF-32. Thus the term "code unit", as
used for Unicode and in the PCRE2 documentation, always refers to one
byte.
In UTF mode, characters in patterns and the strings to be matched are
interpreted as UTF-8 code points, and hence may correspond to one to
four bytes. When UTF is not enabled, characters in patterns and
strings are represented by exactly one byte.
See `pcre2unicode(3)`_ for the details of PCRE2 Unicode support.
$Object regex(STRING pattern, BOOL allow_empty_class=0, BOOL anchored=0,
ENUM {ANYCRLF, UNICODE} bsr=0, BOOL alt_bsux=0,
......@@ -30,19 +619,159 @@ $Object regex(STRING pattern, BOOL allow_empty_class=0, BOOL anchored=0,
BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0,
BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0)
# XXX options for dfa_match, jit fast path, start_offset
# XXX option to make saving the match ctx with PRIV_CALL optional
Create a ``regex`` object from ``pattern`` according to the given
compile options (or option defaults). If the pattern is invalid, then
the VCL will fail to load, and the VCC compiler will emit an error
message.
Examples::
sub vcl_init {
# Match this pattern against the Host header (hence
# case-insensitively), and capture part of the domain name.
new domain = pcre2.regex("^www\.([^.]+)\.com$", caseless=true);
# Match a max-age tag and capture the number.
new maxage = pcre2.regex("max-age\s*=\s*(\d+)");
# Group possible subdomains without capturing
new submatcher = pcre2.regex("^www\.(domain1|domain2)\.com$",
never_capture=true, caseless=true);
}
$Method BOOL .match(PRIV_CALL, PRIV_TASK, STRING subject, INT len=0,
BOOL anchored=0, INT match_limit=0, INT offset_limit=0,
BOOL notbol=0, BOOL noteol=0, BOOL notempty=0,
BOOL notempty_atstart=0, BOOL no_jit=0,
BOOL no_utf_check=0, INT recursion_limit=0)
Return ``true`` if the compiled regex matches the ``subject`` string,
as constrained by the given match options or option defaults.
The match may fail if any of the options are illegal for one of the
reasons given above, or if a limit such as the match or recursion
limit is reached. In that case, and error message is written to the
Varnish log using the ``VCL_Error`` tag, and the method returns
``false``.
If ``subject`` is undefined, for example if it is set from an unset
header variable, then it is assumed to be the empty string. This
follows VCL's handling of regex matching when the string to be matched
is unset.
Example::
if (domain.match(req.http.Host)) {
call do_on_match;
}
$Method STRING .backref(INT ref, STRING fallback = "**BACKREF METHOD FAILED**")
Returns the `nth` captured subexpression from the most recent
successful call of the ``.match()`` method for this object in the same
client or backend context, or a fallback string in case the capture
fails. Backref 0 indicates the entire matched string. Thus this
function behaves like the ``\n`` in the native VCL functions
``regsub`` and ``regsuball``, and the ``$1``, ``$2`` ... variables in
Perl. Unlike the regsubs, which limit the backref number to 0 through
9, ``backref`` permits any number that identifies a capturing group in
the pattern.
Since Varnish client and backend operations run in different threads,
``.backref()`` can only refer back to a ``.match()`` call in the same
thread. Thus a ``.backref()`` call in any of the ``vcl_backend_*``
subroutines -- the backend context -- refers back to a previous
``.match()`` in any of those same subroutines; and a call in any of
the other VCL subroutines -- the client context -- refers back to a
``.match()`` in the same client context.
After unsuccessful matches, the ``fallback`` string is returned for
any call to ``.backref()``. The default value of ``fallback`` is
``"**BACKREF METHOD FAILED**"``. ``.backref()`` always fails after a
failed match, even if ``.match()`` had been called successfully before
the failure.
``.backref()`` may also return ``fallback`` after a successful match,
if no captured group in the matching string corresponds to the backref
number. For example, when the pattern ``(a|(b))c`` matches the string
``ac``, there is no backref 2, since nothing matches ``b`` in the
string.
The VCL infix operators ``~`` and ``!~`` do not affect this method,
nor do the functions ``regsub`` or ``regsuball``. Nor is it affected
by the matches performed by any other method or function in this VMOD,
(the ``match()`` function or the ``sub`` method or function).
``.backref()`` fails, returning ``fallback`` and writing an error
message to the Varnish log with the ``VCL_Error`` tag, under the
following conditions (even if a previous match was successful and a
substring could have been captured):
* Any of the match options are illegal (for example, if one of the
numeric limits was set to less than 0).
* The ``fallback`` string is undefined.
* ``ref`` (the backref number) is out of range -- if it is less than 0
or larger than the highest number for a capturing group in the
pattern.
* ``.match()`` was never called for this object in the task scope
prior to calling ``.backref()``.
Example::
if (domain.match(req.http.Host)) {
set req.http.X-Domain = domain.backref(1);
}
$Method STRING .namedref(STRING name,
STRING fallback = "**NAMEDREF METHOD FAILED**")
Returns the captured subexpression designated by ``name`` from the
most recent successful call to ``.match()`` in the current context
(client or backend), or ``fallback`` in case of failure. See
`pcre2pattern(3)`_ for details about the use of named subpatterns in
PCRE2 regexen.
Note that a named capturing group can also be referenced as a numbered
group -- the named groups are numbered exactly as if the names were
not present. So an expression returned by ``.namedref()`` will also be
returned by ``.backref()`` with the appropriate number.
``fallback`` is returned when ``.namedref()`` is called after an
unsuccessful match. The default fallback is ``"**NAMEDREF METHOD
FAILED**"``.
Like ``.backref()``, ``.namedref()`` is not affected by native VCL
regex operations, nor by any other matches performed by methods or
functions of the VMOD, except for a prior ``.match()`` for the same
object.
``.namedref()`` fails, returning ``fallback`` and logging a
``VCL_Error`` message, if:
* The ``fallback`` string is undefined.
* ``name`` is undefined.
* There is no such named group.
* ``.match()`` was not called for this object.
Example::
sub vcl_init {
new domain = pcre2.regex("^www\.(?<domain>[^.]+)\.com$");
}
sub vcl_recv {
if (domain.match(req.http.Host)) {
set req.http.X-Domain = domain.namedref("domain");
}
}
$Method STRING .sub(PRIV_CALL, PRIV_TASK, STRING subject, STRING replacement,
INT len=0, BOOL anchored=0, INT match_limit=0,
INT offset_limit=0, BOOL notbol=0, BOOL noteol=0,
......@@ -51,6 +780,113 @@ $Method STRING .sub(PRIV_CALL, PRIV_TASK, STRING subject, STRING replacement,
BOOL sub_extended=0, BOOL unknown_unset=0,
BOOL unset_empty=0)
If the pattern represented by this object matches ``subject``, then
return a string formed by replacing the part that was matched by
``replacement``. If the pattern does not match, then return the
``subject`` string unchanged. The match and substitution options affect
these operations as described above.
This method is similar to the native VCL ``regsub`` function, or
``regsuball`` when the ``suball`` option is true, but the syntax of
the replacement string is different. In the replacement string, these
sequences can be used to insert strings:
``$$``
Inserts a dollar character.
``$<n>`` of ``${<n>}``
Inserts the contents of group ``<n>`` captured during the match,
where ``<n>`` can be a number or a name. The number can be 0 to
include the entire matched string. Braces are only required if the
following character would be interpreted as part of the number or
name.
``$*MARK`` or ``${*MARK}``
Insert the name of the last ``(*MARK)`` encountered in the match.
For example, to rewrite URLs with prefixes of the form ``"/~<user>"``
so that their prefix is ``"/u/<user>"`` (and leave other URLs
unchanged)::
sub vcl_init {
new user = pcre2.regex("/~([^/]+)(.*)", anchored=true);
}
sub vcl_recv {
set req.url = user.sub(req.url, "/u/${1}${2}");
}
When the ``sub_extended`` option is false, only the dollar character
is special in the replacement string. When ``sub_extended`` is true,
the replacement syntax also has these capabilites:
* Backslashes in the replacement string are interpreted as escapes,
and special backslash sequences are interpreted as for PCRE2
patterns. For example, ``\n`` denotes newline, and ``\x{ddd}``,
where each ``d`` is a digit, specifies a character code. A backslash
followed by a non-alphanumeric character quotes the character, and
``\Q`` and ``\E`` can be used to quote a longer sequence.
* Four additional escape sequences can be used to force the case of
inserted letters:
* ``\U`` forces upper case for all of the following text until
``\E``, or to the end of the string if there is no ``\E``.
* ``\L`` through ``\E`` or end of string forces lower case.
* ``\u`` and ``\l`` force the next character, if it is a letter, to
upper and lower case, respectively.
Case forcing applies to all inserted characters, including those from
captured groups and in sequences quoted by ``\Q`` through ``\E``.
Sequences ending in ``\E`` do not nest. So for example,
``"\Uaa\LBB\Ecc\E"`` results in ``"AAbbcc"``, and the final ``\E`` has
no effect.
* The "dollar" replacement expressions have an additional capability
inspired by Bash to handle unset capturing groups:
``${<n>:-<string>}``
As with ``${<n>}``, ``<n>`` can be a number or name. If group
``<n>`` is set, then its contents are inserted, otherwise
``<string>`` is expanded and inserted. ``<string>`` may, in turn,
include elements of the replacement syntax that are interpreted
accordingly.
``${<n>:+<string1>:<string2}``
If group ``<n>`` is set, insert the result of expanding
``<string1>``, otherwise insert the result of expanding
``<string2>``.
Colons and escapes in the replacement strings can be escaped with
backslashes.
For example, to rewrite Host headers of the form
``www.<sub1>.<sub2>.<tld>`` to ``<sub2>.<tld>``, and of the form
``www.<sub>.<tld>`` to ``<sub>.<tld>``, while also normalizing the header
to lower-case, and leaving other Host headers unchanged::
sub vcl_init {
new hostsub = pcre2.regex(extended=true, pattern={"
"^www\. # www. prefix
([^.]+) # group 1, "<sub1>"
(?: # non-capturing parentheses
\.([^.]+) # dot, then group 2, "<sub2>"
)? # 0 or 1 of group 2
\.([^.]+)$ # dot, then group 3, "<tld>"
"});
}
sub vcl_recv {
set req.http.Host = hostsub.sub(req.http.Host, sub_extended=true,
replacement="\L${2:+$2:$1}.$3");
}
``.sub()`` fails, returning NULL while logging a ``VCL_Error`` message,
if ``replacement`` is undefined.
$Method BOOL .info_bool(ENUM {ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX,
ALT_CIRCUMFLEX, ALT_VERBNAMES, CASELESS,
DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED,
......@@ -63,12 +899,231 @@ $Method BOOL .info_bool(ENUM {ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX,
HAS_LASTCODEUNIT, HAS_BACKSLASHC, HAS_CRORLF,
JCHANGED, MATCH_EMPTY}, BOOL compiled=1)
Return true or false about a property of the regex that the object
represents. This method and the other ``.info_*`` methods may be
helpful for debugging and optimizing regular expression matching, for
example by determining whether PCRE2 could enable certain
optimizations for the pattern.
The ENUM determines which property is to be inspected. If the ENUM is any
one of::
ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX, ALT_CIRCUMFLEX,
ALT_VERBNAMES, CASELESS, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED,
FIRSTLINE, MATCH_UNSET_BACKREF, MULTILINE, NEVER_BACKSLASH_C,
NEVER_UCP, NEVER_UTF, NO_AUTO_CAPTURE, NO_AUTO_POSSESS,
NO_DOTSTAR_ANCHOR, NO_START_OPTIMIZE, NO_UTF_CHECK, UCP, UNGREEDY,
USE_OFFSET_LIMIT, UTF
then the return value of ``info_bool()`` indicates whether the
corresponding compile option is true for the pattern. If ``compiled``
is true, then the return indicates whether the option was set to true
after the pattern was compiled, even if it was specified differently
(or left to the default) in the object constructor. If ``compiled``
is false, then the method returns the value of the option as it was
provided in the constructor.
For example, if the compile option ``anchored`` was set to false in
the constructor (or left to the default), PCRE2 may nevertheless
determine that the pattern is anchored if certain conditions are
satisfied (which are described in detail in `pcre2api(3)`_). In that
case, ``info_bool()`` will return true if ``compiled`` is true, and
false if ``compiled`` is false.
``compiled`` is true by default, and is ignored for the other ENUM
values.
The other ENUMs are interpreted as follows:
``HAS_FIRSTCODEUNIT``
If the pattern is unanchored, PCRE2 may determine that there is a
unique code unit (a byte) that must appear at the start of the
matching part of a string. For example, the part of a string that
matches ``(cat|cow|coyote)`` must begin with a
``c``. ``info_bool(HAS_FIRSTCODEUNIT)`` returns true if there is
such a code unit, and false if the pattern is anchored or if no
unique first code unit could be determined. If there is such a first
code unit, it is returned by ``info_str(FIRSTCODEUNIT)``. Note that
in non-UTF mode, the first code unit is the same as the first
character, but for UTF-8 patterns, it may be the first byte in a
multibyte character.
``MATCH_ATSTART``
If the pattern is unanchored and no unique first code unit in the
matching part of the string is known, PCRE2 may determine that the
pattern is constrained to match at the start of the subject string,
or following a newline in the subject. In that case,
``info_bool(MATCH_ATSTART)`` returns true; it returns false if the
pattern is anchored, if a unique first code unit could be found, or
if the pattern could not be determined to match at the start.
``HAS_LASTCODEUNIT``
Under certain circumstances, PCRE2 may determine a rightmost literal
code unit that must exist in a matching string, other than at the
start. This is not necessarily the last byte in the matching part of
a string, but rather the last literal code unit known to be
required. For example, the ``b`` is recorded for this purpose for
the pattern ``ab\d+``, although the ``b`` must be followed by
digits. In there is such a last code unit,
``info_bool(HAS_LASTCODEUNIT)`` returns true, and that value can be
retrieved from ``info_str(LASTCODEUNIT)``. For anchored patterns,
PCRE2 records a possible last literal code unit only if a part of
the pattern that comes before it has variable length. For example,
``z`` is recorded for ``^a\d+z\d+`` (because one or more digits must
come before it), but none is recorded for ``^a\dz\d`` (because
matching strings have a fixed length). As with the first code unit,
the last code unit may be a byte in a multibyte UTF-8 character, if
UTF is enabled for the pattern.
``HAS_BACKSLASHC``
Return true if and only if ``\C`` appears in the pattern.
``HAS_CRORLF``
Return true if and only if the pattern contains explicit matches for
CR or LF characters. These can be literal carriage returns or
linefeeds in the pattern, or the escape sequences ``\r`` or ``\n``.
``JCHANGED``
Return true if and only if the pattern contains ``(?J)`` or ``(?-J)``
to enable or disable JIT-matching.
``MATCH_EMPTY``
Return true if and only if PCRE2 determines that the pattern might
match the empty string. For certain complex patterns (with recursive
subroutines), it may not be possible to determine; in that case,
PCRE2 cautiously returns true.
Example::
# To determine if the FIRSTCODEUNIT optimization could be applied.
if (myregex.info_bool(HAS_FIRSTCODEUNIT)) {
std.log("First matching char in the pattern = "
+ myregex.info_str(FIRSTCODEUNIT));
}
$Method INT .info_int(ENUM {BACKREFMAX, CAPTURECOUNT, JITSIZE, MATCHLIMIT,
MAXLOOKBEHIND, MINLENGTH, RECURSIONLIMIT, SIZE})
Return an integer that describes a property of the pattern that the
object represents, as determined by the ENUM.
``BACKREFMAX``
Return the highest back reference within the pattern. Remember that
named groups also acquire group numbers, and thus count towards the
highest backref. A conditional subpattern such as ``(?(3)a|b)``,
which checks if a capturing group is set, also counts as a
backref. If there are no backrefs, return 0.
``CAPTURECOUNT``
Return the highest capturing group number in the pattern. If the
``(?|`` construct (which allows duplicate group numbers, see
`pcre2pattern(3)`_) is not used in the pattern, then the value
returned is also the total number of capturing groups.
``JITSIZE``
Return the size of JIT-compiled code for the pattern. Returns 0 if
the pattern was not JIT-compiled.
``MATCHLIMIT``
If the pattern contains the construct ``(*LIMIT_MATCH=nnnn)`` to set
the match limit (see the match option ``match_limit`` above), then
return the limit that it sets. Returns -1 if no such value has been
set.
``MAXLOOKBEHIND``
Return the number of characters in the longest lookbehind assertion
in the pattern. Returns 0 if there are no lookbehinds.
``MINLENGTH``
If PCRE2 has determined that there is a lower bound for the length
of a string that may match the pattern, then return that
value. Returns 0 if no lower bound is known. This is not necessarily
the same as the shortest string that may possibly match; but any
string that does match must be at least that long.
``RECURSIONLIMIT``
If the pattern contains the construct ``(*LIMIT_RECUSRION=nnnn)``
(see the match option ``recursion_limit`` above), then return the
value that was set. Returns -1 if no such value has been set.
``SIZE``
Return the size of the compiled pattern, as used for the
interpretive matcher, in bytes. This is independent of the value
returned by ``info_int(JITSIZE)``.
Example::
# To determine if a lower bound on the length of matching strings
# could be found.
if (myregex.info_int(MINLENGTH) != 0) {
std.log("Lower bound on matching string length = "
myregex.info_int(MINLENGTH));
}
else {
std.log("No lower bound for matching string lengths found");
}
$Method STRING .info_str(ENUM {BSR, FIRSTCODEUNIT, FIRSTCODEUNITS, LASTCODEUNIT,
NEWLINE}, STRING sep=" ")
Return a string that describes a property of the pattern represented
by the object, as determined by the ENUM. The ``sep`` parameter is
only relevant when the ENUM ``FIRSTCODEUNITS`` is used, as described
below.
``BSR``
Return ``"UNICODE"``, meaning that ``\R`` in the pattern matches any
Unicode line ending sequence, or ``"ANYCRLF"``, meaning that it
matches only CR, LF or CRLF.
``FIRSTCODEUNIT``
If PCRE2 determines that there is a unique first code unit that must
begin the matching part of a string (as described above for
``info_bool(HAS_FIRSTCODEUNIT)``), then return that code unit in a
string. Returns the empty string if no such code unit was
determined; this is also the case if the pattern is anchored. Recall
that a code unit corresponds to a character in non-UTF mode, but may
be a byte in a multibyte character when UTF-8 is enabled. The code
unit is not escaped in the return string.
``FIRSTCODEUNITS``
(Note the difference between ``FIRSTCODEUNIT``, singular, and
``FIRSTCODEUNITS``, plural.) For an unanchored pattern, if PCRE2
cannot determine a unique code unit that must appear at the start of
the matching part of a string, it may be able to determine a set of
such code units. For example, if the pattern starts with ``[abc]``,
then the matching part must begin with ``a``, ``b`` or ``c``. In
that case, ``info_str(FIRSTCODEUNITS)`` returns those code units in
a string, separated by the string given as ``sep``. The default
value of ``sep`` is ``" "`` (the string containing one space). If
the pattern is anchored, or if a unique first code unit could be
found, or if no set of first code units could be found, then return
the empty string.
``LASTCODEUNIT``
If PCRE2 has recorded a rightmost literal code unit that must exist
in a matching string, as described for ``info_bool(HAS_LASTCODEUNIT)``
above, then return that code unit in a string. Returns the empty
string if no such code unit was recorded.
``NEWLINE``
Return a string describing the default sequence recognized as a
"newline" for the pattern:
* ``"CR"`` (carriage return)
* ``"LF"`` (linefeed)
* ``"CRLF"`` (CR followed by LF)
* ``"ANYCRLF"`` (CR, LF or CRLF)
* ``"UNICODE"`` (any Unicode line-ending sequence)
Example::
# Determine if a set of first matching characters could be found.
std.log("First matching chars: " + myregex.info_str(FIRSTCODEUNITS));
Regex functional interface
--------------------------
$Function BOOL match(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject,
BOOL allow_empty_class=0, BOOL anchored=0,
ENUM {ANYCRLF, UNICODE} bsr=0, BOOL alt_bsux=0,
......@@ -89,12 +1144,87 @@ $Function BOOL match(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject,
BOOL notempty_atstart=0, BOOL no_jit=0,
INT recursion_limit=0)
Compile the ``pattern`` and return true if it matches
``subject``. Compilation and matching are subject to the given
options, or default options. The compiled pattern is discarded after
use, and ``pattern`` is compiled on every invocation.
The call fails, logging an ``VCL_Error`` message and returning false,
if:
* ``pattern`` is undefined.
* The compile fails (for example due to a syntax error).
* Any compile or match option is illegal as described above.
As with the ``.match()`` method, if ``subject`` is undefined, then it
is assumed to be the empty string.
Example::
# Match a request header against a pattern provided in a response
# header.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
call do_on_match;
}
$Function STRING backref(PRIV_TASK, INT ref,
STRING fallback = "**BACKREF FUNCTION FAILED**")
Return the `nth` captured subexpression from the most recent
successful call of the ``match()`` function in the current client or
backend context, or a fallback string if the capture fails. The
default ``fallback`` is ``"**BACKREF FUNCTION FAILED**"``.
As with the ``regex.backref()`` method, ``fallback`` is returned
after any failed invocation of the ``match()`` function, or if there
is no captured group corresponding to the backref number. The function
is not affected by native VCL regex operations, or any other method or
function of the VMOD except for the ``match()`` function.
The function fails, returning ``fallback`` and logging a ``VCL_Error``
message, under the same conditions as the corresponding method:
* ``fallback`` is undefined.
* ``ref`` is out of range.
* The ``match()`` function was never called in this context.
* The pattern failed to compile for the previous ``match()`` call.
Example::
# Match against a pattern provided in a response header, and capture
# subexpression 1.
if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
set resp.http.X-Group-1 = pcre2.backref(1);
}
$Function STRING namedref(PRIV_TASK, STRING name,
STRING fallback = "**NAMEDREF FUNCTION FAILED**")
Return the captured subexpression designated by ``name`` from the most
recent successful call of the ``match()`` function in the current
context, or ``fallback`` in case of failure. The default fallback is
``"**NAMEDREF FUNCTION FAILED**"``.
The function returns ``fallback`` when the previous invocation of the
``match()`` function failed, and is only affected by use of the
``match()`` function. The function fails, returning ``fallback`` and
logging a ``VCL_Error`` message, under the same conditions as the
corresponding method:
* ``fallback`` is undefined.
* ``name`` is undefined or the empty string.
* There is no such named group.
* ``match()`` was not called in this context.
* The pattern failed to compile for the previous ``match()`` call.
Example::
if (pcre2.match(resp.http.X-Pattern-With-Names, req.http.X-Subject)) {
set resp.http.X-Group-Foo = pcre2.namedref("foo");
}
$Function STRING sub(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject,
STRING replacement, BOOL allow_empty_class=0,
BOOL anchored=0, ENUM {ANYCRLF, UNICODE} bsr=0,
......@@ -116,42 +1246,239 @@ $Function STRING sub(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject,
INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0,
BOOL unknown_unset=0, BOOL unset_empty=0)
Compile ``pattern``, and if it matches ``subject``, then return a string
formed by replacing the part that matched by ``replacement``. If the
pattern does not match, return ``subject`` unchanged. The compile, match
and substitution options affect all of these operations, as described
above.
The syntax of the ``replacement`` string, as modified if the
``sub_extended`` option is true, is the same as documented above for
the ``.sub()`` method.
``sub()`` fails, returning NULL and logging a ``VCL_Error`` message,
if:
* Either of ``pattern`` or ``replacement`` is undefined.
* ``pattern`` cannot be compiled.
Example::
# If the beresp header X-Sub-Letters contains "b+", and Host contains
# "www.yabba.dabba.doo.com", then set X-Yada to
# "www.yada.dabba.doo.com".
set beresp.http.X-Yada = re2.sub(beresp.http.X-Sub-Letters,
bereq.http.Host, "d");
Library configuration
---------------------
$Function BOOL config_bool(ENUM {JIT, STACKRECURSE, UNICODE})
Return true or false about a property of the PCRE2 library to which
the VMOD is linked, identified by the ENUM. The ``config_*`` functions
make it possible to discover features of the library that were chosen
when it was built.
``JIT``
Return true if the library supports just-in-time compilation and
matching.
``STACKRECURSE``
Return true if internal recursion for the PCRE2 matcher uses the
system stack to maintain its state, which is the usual way the
library is built. If false is returned, PCRE2 uses blocks of data on
the heap rather than recursive function calls.
``UNICODE``
Return true if Unicode support is available. If so, then the compile
option ``utf`` can be used to define a pattern and the strings
against which it is matched as UTF-8 strings.
Example::
if (pcre2.config_bool(JIT)) {
std.log("JIT supported for PCRE2");
}
else {
std.log("JIT not supported for PCRE2");
}
$Function STRING config_str(ENUM {BSR, JITTARGET, NEWLINE, UNICODE_VERSION,
VERSION})
Return a string describing a property of the PCRE2 library.
``BSR``
Return a string indicating what the ``\R`` escape sequence matches
by default: ``UNICODE`` for Unicode line-ending sequences, or
``ANYCRLF`` for only CR, LF and CRLF. This is the default that holds
if no value is given for the compile option ``bsr``.
``JITTARGET``
Return a string identifying the architecture for which the JIT
compiler is configured. If JIT is not enabled, the returned string
contains the phrase ``"JIT not supported"``.
``NEWLINE``
Return a string identifying the character sequence that is recognized
by default as a newline:
* ``"CR"`` (carriage return)
* ``"LF"`` (linefeed)
* ``"CRLF"`` (CR followed by LF)
* ``"ANY"`` (any Unicode line ending)
* ``"ANYCRLF"`` (any of CR, LF or CRLF)
This is the default if no value is given for the compile option
``newline``.
``UNICODE_VERSION``
If Unicode is supported by the library, return the Unicode version
string. If not, return ``"Unicode not supported"``.
``VERSION``
Return the PCRE2 version string.
Example::
std.log("Linked to PCRE2 version " + pcre2.config_str(VERSION));
$Function INT config_int(ENUM {LINKSIZE, MATCHLIMIT, PARENSLIMIT,
RECURSIONLIMIT})
Return an integer describing a property of the PCRE2 library.
``LINKSIZE``
Return the number of bytes used for internal linkage (offsets) in
compiled regular expressions. This determines the size of the
largest possible pattern; the default link size of 2 allows for
patterns of up to 64K bytes.
``MATCHLIMIT``
Return the default value for the ``match_limit`` compile option,
which limits the effort of the matcher when no match is found.
``PARENSLIMIT``
Return the default value of the ``parens_nest_limit`` compile
option, which limits the depth of parenthesis nesting in patterns,
and hence the use of the stack during compilation.
``RECURSIONLIMIT``
Return the default value of the ``recursion_limit`` compile option,
which limits the depth of recursion, and hence stack usage, for the
the interpretive (non-JIT) matcher.
Example::
std.log("Default PCRE2 match limit = " + config_int(MATCHLIMIT));
$Function STRING version()
Returns the version string for this VMOD.
Example::
std.log("Using VMOD pcre2 version " + pcre2.version());
std.log("Using VMOD pcre2 version " + pcre2.version());
REQUIREMENTS
============
This VMOD requires Varnish ...
This VMOD has been tested with Varnish version 5.1.2 and PCRE2 version
10.23.
INSTALLATION
============
See `INSTALL.rst <INSTALL.rst>`_ in the source repository.
LIMITATIONS
===========
...
The VMOD allocates Varnish workspace for a variety of purposes:
INSTALLATION
============
* The string returned by the ``sub`` method and function.
See `INSTALL.rst <INSTALL.rst>`_ in the source repository.
* Buffers for temporary data structures needed by the PCRE2 library,
for example to save information about a match for use by the
``backref`` and ``namedref`` methods and functions.
* A copy of the subject string for the ``match`` method and function,
if it is not already in workspace, so that it can be safely accessed
by ``backref`` and ``namedref``.
* Return strings for some uses of ``info_str`` and ``config_str``.
* Temporary buffers for error message strings from the PCRE2 library.
If VMOD operations fail with the "out of space" error message in the
Varnish log (with the ``VCL_Error`` tag), increase the varnishd runtime
parameters ``workspace_client`` and/or ``workspace_backend``.
The PCRE2 interpretive and JIT matchers are backtracking matchers, and
the interpretive matcher is recursive, using part of the stack on each
recursive call (in the default library configuration). For patterns
with large search spaces, this can lead to slow matches, high CPU
usage, and stack overflow due to deep recursion, which typically
causes Varnish to segfault. This has occasionally been the subject of
issues reported to the Varnish project.
For most common uses of regular expressions in VCL, PCRE2 is very fast
and has minimal resource consumption. This depends strongly on how the
regex is written -- a well-crafted pattern helps the matcher limit
backtracking, fail early on non-matches, and make use of some the
optimizations that PCRE2 can apply. Some of the compile and match
options also help to optimize the match operation. Which of these
measures is possible depends, of course, on what you want the regex to
do.
Writing optimized regexen is a very broad subject, beyond the scope of
this manual. There is some advice in `pcre2perform(3)`_, and in many
other sources.
If your use case requires patterns and subject strings that can lead
to very large search spaces, consider using some of the options
available in the VMOD that limit excessive effort for unsuccessful
matches. In particular, consider lowering the match options
``match_limit`` and ``recursion_limit``. You can also use
``offset_limit`` to set a maximum length to search for a match in the
subject string (for which you will have to set the compile option
``use_offset_limit``). These may cause the matcher to halt before it
has exhausted all possibilities for a match (but it appears to be
common that, if the matcher has to search for a long time, then there
was never any match to be found).
If you encounter stack overflow, it may help to increase the stack
size (by changing ``limits.conf`` or calling ``ulimit -s`` before
starting Varnish). Since Varnish 4.1, you can also increase the
varnishd parameter ``thread_pool_stack``. Bear in mind that this
increases the total RAM usage of Varnish.
ACKNOWLEDGEMENTS
================
A tip of the hat to Philip Hazel, who released the first version of
PCRE twenty years before this VMOD was developed.
A few sentences in this manual are identical to or very closely track
phrasings in the PCRE2 documentation, if there was simply no better
way to say what needs to be said.
SEE ALSO
========
* varnishd(1)
* vcl(7)
* source repository: https://code.uplex.de/uplex-varnish/libvmod-pcre2
* pcre2(3)
* PCRE web site: http://www.pcre.org/
* VMOD source repository: https://code.uplex.de/uplex-varnish/libvmod-pcre2
.. _pcre2(3): http://www.pcre.org/current/doc/html/pcre2.html
.. _pcre2pattern(3): http://www.pcre.org/current/doc/html/pcre2pattern.html
.. _pcre2syntax(3): http://www.pcre.org/current/doc/html/pcre2syntax.html
.. _pcre2api(3): http://www.pcre.org/current/doc/html/pcre2api.html
.. _pcre2unicode(3): http://www.pcre.org/current/doc/html/pcre2unicode.html
.. _pcre2perform(3): http://www.pcre.org/current/doc/html/pcre2perform.html
$Event event
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment