libvmod-pcre2

Varnish Module (VMOD) to access the PCRE2 regular expression library.
Find a file
Geoff Simmons 3ffe3995c0 Fix a typo		2017-07-05 00:03:32 +02:00
m4	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
src	Fix a typo	2017-07-05 00:03:32 +02:00
.dir-locals.el	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
.gitignore	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
autogen.sh	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
configure.ac	Add some autoconf checks recommended by autoscan, and don't check	2017-07-04 22:57:01 +02:00
CONTRIBUTING.rst	Update the CONTRIBUTING notes for developers.	2017-07-04 23:45:41 +02:00
COPYING	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
INSTALL.rst	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
LICENSE	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
Makefile.am	initial commit, passes tests for an initial version of regex.match()	2017-02-05 10:31:31 +01:00
README.rst	Fix a typo	2017-07-05 00:03:32 +02:00
README.rst

..
.. NB:  This file is machine generated, DO NOT EDIT!
..
.. Edit vmod.vcc and run make instead
..

.. role:: ref(emphasis)

.. _vmod_pcre2(3):

==========
vmod_pcre2
==========

-------------------------------------------
access the pcre2 regular expression library
-------------------------------------------

:Manual section: 3

SYNOPSIS
========

import pcre2 [from "path"] ;


::

  # object interface
  new OBJECT = pcre2.regex(STRING pattern [, compile options])
  BOOL <OBJ>.match(STRING subject [, match options])
  STRING <OBJ>.backref(INT ref)
  STRING <OBJ>.namedref(STRING name)
  STRING <OBJ>.sub(STRING subject, STRING replacement [, match options]
                   [, substitution options])
  BOOL <OBJ>.info_bool(ENUM)
  INT <OBJ>.info_int(ENUM)
  STRING <OBJ>.info_str(ENUM)

  # function interface
  BOOL pcre2.match(STRING pattern, STRING subject [, compile options]
                   [, match options])
  STRING pcre2.backref(INT ref)
  STRING pcre2.namedref(STRING name)
  STRING pcre2.sub(STRING pattern, STRING subject, STRING replacement
                   [, compile options] [, match options]
                   [, substitution options])

  # library configuration
  BOOL pcre2.config_bool(ENUM)
  INT pcre2.config_int(ENUM)
  STRING pcre2.config_str(ENUM)

DESCRIPTION
===========

This Varnish Module (VMOD) provides access to the PCRE2 regular
expression library. PCRE2 is the Perl-compatible regular expression
library with a revised API, the successor to the PCRE library that
implements native regexen in Varnish VCL. See `pcre2(3)`_ and the
manuals that it references for details about the PCRE2 library.

PCRE2, by itself, does not change regular expressions from the
perspective of the end user -- the syntax and semantics of patterns
and pattern matching remained largely the same at the time PCRE2 was
introduced. The new library is a refactoring of the internal API,
which is transparent to the user, and the VMOD endeavors to make use
of the new internal features advantageously for VCL.

Some of the differences between the VMOD and native VCL regexen are:

* The VMOD provides methods and functions to retrieve back references
  after a match that are easier to use than the idiom with the
  ``regsub`` function that is necessary in native VCL. It also
  provides the means to retrieve references to named capturing groups.

* The functional interface makes it possible to use patterns that are
  not known until runtime.

* PCRE2 introduces a new native substitution function, similar to the
  ``regsub`` and ``regsuball`` functions in VCL, except that the
  substitution syntax is different and provides more features.

* Parameters that limit the depth of recursion and backtracking in
  match operations, which are set globally in Varnish, can be set for
  individual matches in the VMOD.

* The VMOD can support matching against UTF-8 strings, if it is
  running against a PCRE2 library that was built to support Unicode.

* The VMOD exposes considerably more functionality of the underlying
  library. VCL provides a general-purpose regular expression facility
  -- PCRE could be easily replaced as its regex engine. The VMOD is
  meant to be specific to PCRE2, and makes a full range of its
  features available in VCL.

* The VMOD provides methods and functions that allow you to inspect
  properties of patterns and of the library. These are not likely to
  be useful on the fast path of production deployments, and are not
  optimized for that. But they may be useful during development to
  debug and optimize regex matching.

Since the introduction of PCRE2, the original PCRE library is being
maintained for bugfixes, but development of new features and
optimizations are only being done for PCRE2. So the VMOD will make it
possible to take advantage of improvements in the library as they are
released.

Here are some simple usage examples::

  # regex objects are created in vcl_init, and the regular expressions
  # are compiled when VCL is loaded.
  sub vcl_init {
      # A regex to match the "foo" cookie, and capture its value.
      new foo = pcre2.regex("\bfoo=([^;,\s]+\b)");

      # A regex to match a URL beginning with the prefix "/bar/", and
      # capture its suffix.
      new bar = pcre2.regex("^/bar/(.+)");
  }

  sub vcl_recv {
      # If the cookie header contains "foo", then assign its value
      # to another header.
      if (foo.match(req.http.Cookie)) {
          set req.http.X-Foo-Value = foo.backref(1);
      }

      # If the URL begins with "/bar/", then replace the prefix with
      # "/baz/quux/".
      if (bar.match(req.url)) {
          set req.url = "/baz/quux/" + bar.backref(1);
      }
  }

Object and functional interfaces
--------------------------------

The VMOD provides regular expression operations by way of the
``regex`` object interface and a functional interface. For ``regex``
objects, the pattern is compiled at VCL initialization time, and the
compiled pattern is re-used for each invocation of its
methods. Compilation failures (due to errors in the pattern) cause
failure at initialization time, and the VCL fails to load. The
``.backref()`` and ``.namedref()`` methods refer back to the last
invocation of the ``.match()`` method for the same object. The
``.sub()`` method also re-uses an object's compiled pattern.

The functional interface provides the same set of operations, but the
pattern is compiled at runtime on each invocation of the ``match()``
and ``sub()`` functions (and then discarded). Compilation failures are
reported as errors in the Varnish log. The ``backref()`` and
``namedref()`` functions refer back to the last invocation of the
``match()`` function, for any pattern.

Compiling a pattern at runtime on each invocation is considerably more
costly than re-using a compiled pattern. So for patterns that are
fixed and known at VCL initialization, the object interface should be
used. The functional interface should only be used for patterns whose
contents are not known until runtime.

Compile, match and substitution options
---------------------------------------

The VMOD has unusually long lists of parameters for its methods and
functions -- over 40 for the ``sub()`` function, for example. But
nearly all of these have default values, and it is only necessary to
specify options in VCL that differ from the defaults.

The optional parameters affect the interpretations of patterns and the
operation of matches and substitutions, and come in three groups:

* *Compile* options, used wherever a pattern is compiled: in the
  ``regex`` object constructor, and the ``match()`` and ``sub()``
  functions.

* *Match* options, used wherever a match is performed: in the
  ``match`` and ``sub`` methods and functions.

* *Substitution* options, used in the ``sub`` method and function.

The options have call scope, meaning that they are evaluated only once
for each invocation of a function or method at its particular location
in the VCL source, on the first invocation after the VCL instance is
loaded. The options are then cached and re-used for all subsequent
invocations, and cannot be changed (until a new VCL instance is
loaded).

Compile options
~~~~~~~~~~~~~~~

Compile options define properties of patterns. See `pcre2pattern(3)`_
for details of PCRE2 pattern syntax, and `pcre2syntax(3)`_ for a quick
reference.

The default value of all of the BOOL options is **false**.

See also `JIT compilation and matching`_ below.

``allow_empty_class``
  If true, then a pattern may include ``[]`` to denote an empty
  character class. This, in part, supports compatibility with regexen
  in ECMAscript (also known as Javascript). By default, a closing
  square bracket after an opening one is interpreted as a character in
  the class (and ``]`` must appear later in the pattern).

``alt_bsux``
  (Referring to "backslash-u" and "backslash-x".) If true, then three
  escape sequences are interpreted differently (for compatibility with
  ECMAscript):

  * ``\U`` matches an upper case ``U`` character. By default, ``\U``
    causes a compile error.

  * ``\u`` matches a lower case ``u``, unless it is followed by four
    hexadecimal digits, in which case the hex number identifies the
    code point to be matched. By default, ``\u`` causes a compile
    error.

  * ``\x`` matches a lower case ``x``, unless it is followed by four
    hex digits, in which case it identifies the code point to match.
    By default, ``\x`` must always be followed by zero to two hex
    digits to identify a one-byte character (for example, ``\xz``
    matches binary zero followed by ``z``).

``alt_circumflex``
  If true, and if ``multiline`` is also true, then the ``^``
  meta-character matches after a newline appearing as the last
  character in a string. By default, ``^`` does not match after
  a terminating newline.

``alt_verbnames``
  If true, then backslash processing may be applied to verb names in
  verb sequences such as ``(*MARK:NAME)``, so that the name can, for
  example, include a closing parenthesis as ``\)`` or between ``\Q``
  and ``\E``. By default, no processing is applied to verb names, and
  they end at the first closing parenthesis (regardless of any
  backslash).

``anchored``
  If true, then the pattern is anchored, meaning that it is
  constrained to match at the starting point of a string. This may
  also be achieved with constructs in the pattern.

``bsr``
  (For "backslash-R".) If this ENUM value is set, then it determines
  which sequences are matched by ``\R``. If set to ``UNICODE``, then
  ``\R`` matches any UTF-8 newline sequence. If set to ``ANYCRLF``,
  then it matches CR (carriage return, or ``\r``), LF (linefeed, or
  ``\n``), or CR followed by LF. By default, ``\R`` matches the
  sequence chosen when the PCRE2 library was built, which can be
  determined from ``config_str(BSR)`` (the default default is
  Unicode). See `pcre2pattern(3)`_ for details about ``\R``.

``caseless``
  If true, then matches for this pattern are case-insensitive. This
  may also be achieved with ``(?i)`` in the pattern.

``dollar_endonly``
  If true, then the ``$`` metacharacter matches only at the end of a
  string. By default, ``$`` also matches before newlines within the
  string (but not before newlines that come immediately after a
  newline). ``dollar_endonly`` is ignored when ``multiline`` is true.

``dotall``
  If true, then the ``.`` metacharacter matches any character,
  including newlines. But it only ever matches one character, even if
  newlines are coded as CRLF. By default, dots do not match
  newlines. The effect of ``dotall`` can also be achieved with
  ``(?s)`` in the pattern.

``dupnames``
  If true, then the names used for named capturing groups are not
  required to be unique. By default, names for capturing groups may
  only be used once.

``extended``
  If true, then pattern syntax is permitted to contain constructs that
  serve as self-documentation:

  * Most whitespace is ignored, except when escaped or inside a
    character class (and a few other exceptions detailed in
    `pcre2api(3)`_).

  * All characters between an unescaped ``#`` and the next newline are
    ignored, and can be used as comments.

    For example, this is a self-documenting declaration of a pattern
    that matches IPv6 addresses::

      new ipv6 = pcre2.regex(extended=true, caseless=true, pattern=
      {"^(?!:)                 # colon disallowed at start
        (?:                    # start of item
          (?: [0-9a-f]{1,4} |  # 1-4 hex digits or
          (?(1)0 | () ) )      # fail if null previously matched
          :                    # followed by colon
        ){1,7}                 # end item; 1-7 of them required
        [0-9a-f]{1,4} $        # final hex number at end of string
        (?(1)|.)               # there was an empty component
      "});

  The effect of ``extended`` can also be achieved with the ``(?x)``
  option in a pattern.

``firstline``
  If true, an unanchored pattern must match before or at the first
  newline in the subject string (though the matched text may continue
  over a newline). If the ``offset_limit`` option is also set for a
  match, then the match must occur within the offset limit and in the
  first line.

``locale``
  If ``locale`` is set to a string matching a locale that is available
  on the system on which Varnish is running, then that locale is used
  for the pattern to determine which characters are letters, digits,
  upper and lower case, and so forth. Hence this option affects the
  interpretation of constructs such as ``\w`` and ``\d``, the
  ``caseless`` option, and so on. This only applies to single-byte
  characters.

  If ``locale`` is set to a string that is not recognized as a locale,
  then compilation fails.

  By default, PCRE2 uses tables established when the library is built
  to recognize character properties; normally, these only recognize
  ASCII characters.

  Quoting `pcre2api(3)`_:

    The use of locales with Unicode is discouraged.  If you are
    handling characters with code points greater than 128, you should
    either use Unicode support, or use locales, but not try to mix the
    two.

``match_unset_backref``
  If true, then a back reference to an unset capturing group matches
  an empty string; thus ``(\1)(a)`` successfully matches ``a``. This
  makes the pattern similar to an ECMAscript pattern. By default, an
  unset backref causes the matcher to backtrack, and possibly fail.

``max_pattern_len``
  If this INT value is greater than 0, then it sets a maximum length
  for the pattern string to be compiled. If the pattern is longer, then
  compilation fails.

``multiline``
  If true, then the ``^`` and ``$`` meta-characters match immediately
  after and before internal newlines in the subject string, respectively,
  in addition to matching at the start and end of the string. By default,
  the start and end anchors only match at the beginning and end of the
  string, regardless of internal newlines. The effect of ``multiline``
  can also be achieved with ``(?m)`` in the pattern.

``never_backslash_c``
  If true, then ``\C`` may not be used in a pattern, and causes
  compile failure. ``\C`` always matches exactly one byte, even in UTF
  mode, and may lead to unpredictable effects if it matches in the
  middle of a multibyte UTF-8 character. ``\C`` may have been
  prohibited by a build-time option in the library, which can be
  discovered by calling ``config_bool(NEVER_BACKSLASH_C)``.

``never_ucp``
  If true, then Unicode properties are not used to interpret ``\B``,
  ``\b``, ``\D``, ``\d``, ``\S``, ``\s``, ``\W``, ``\w``, and some of
  the POSIX character classes in the pattern. It is then impossible to
  activate this facility by including ``(*UCP)`` at the start of the
  pattern. If ``never_ucp`` and ``ucp`` are both set to true, then
  the compile fails.

``newline``
  If this ENUM value is set, it determines which characters are to be
  matched as newlines in the pattern. It can be set to:

  * ``CR`` (carriage return)
  * ``LF`` (linefeed)
  * ``CRLF`` (CR followed by LF)
  * ``ANYCRLF`` (CR, LF or CRLF)
  * ``UNICODE`` (any Unicode line-ending sequences)

  By default, the newline sequence chosen for the PCRE2 library when
  it was built is used, which can be determined from
  ``config_str(NEWLINE)``.

``no_auto_capture``
  If true, then numbered capturing groups are disabled in the pattern.
  Any opening parenthesis not followed by ``?`` is then interpreted as
  if it were followed by ``?:`` (that is, it forms a non-capturing
  group).  Named capturing groups can still be used, and these also
  acquire a capturing group number, so ``namedref`` and ``backref``
  can still be used (but only for the named groups).

``no_auto_possess``
  If true, then the "auto-possessification" optimization is disabled
  for the pattern, which for example interprets ``a+b`` as ``a++b``,
  using the "possessive quantifier", to prevent backtracks into ``a+``
  that can never be successful. If the option is true, then the full
  unoptimized search is run.

``no_start_optimize``
  If true, then some optimizations for the start of the match are
  disabled. This has the effect that certain constructs in the
  pattern, such as ``(*COMMIT)`` or ``(*MARK)``, are evaluated at
  every possible starting position in the string, while they may have
  been skipped when the optimizations are applied. Thus this option
  may change the result of ``match`` calls in patterns that include
  such constructs. See `pcre2api(3)`_ for details.

``no_utf_check``
  If this option and ``utf`` are both true, then validity checks to
  determine if the pattern is a valid UTF string are disabled. This
  may save CPU usage and time for the ``match()`` and ``sub()``
  functions, which compile patterns on every invocation, and check UTF
  strings for validity by default. But you should only do so if you
  are sure that the inputs are valid, because running matches in UTF
  mode against invalid strings is undefined, and may cause Varnish to
  crash or loop.  By default, invalid UTF strings in the pattern cause
  the compile to fail in UTF mode. See `pcre2unicode(3)`_ for details.

``parens_nest_limit``
  If this INT value is greater than 0, it sets the maximum depth of
  parenthesis nesting in a pattern. It applies to all kinds of
  parentheses, not just captruing groups. The limit prevents patterns
  from using too much of the stack when compiled, and may be useful
  for the functional interface, for which patterns are compiled at
  runtime. By default, the nesting limit set for the PCRE2 library at
  build time is imposed, which is returned by
  ``config_int(PARENSLIMIT)``.

``ucp``
  If this option and ``utf`` are both true, then Unicode properties
  are used to interpret ``\B``, ``\b``, ``\D``, ``\d``, ``\S``,
  ``\s``, ``\W``, ``\w``, and some of the POSIX character classes in
  the pattern. The same effect can be achieved by including ``(*UCP)``
  at the start of the pattern. By default, only ASCII characters are
  considered for these constructs, which is faster than considering
  Unicode properties. If Unicode was disabled at build time for the
  PCRE2 library, which can be discovered by calling
  ``config_bool(UNICODE)``, then the compile fails when this option is
  true. Compiles also fail if this option and ``never_ucp`` are both
  true. See `pcre2unicode(3)`_ for details about Unicode character
  properties.

``ungreedy``
  If true, then the "greediness" of quantifiers in the pattern is
  inverted, so that they are not greedy by default, but become
  greedy when followed by ``?``. The same effect can be achieved
  by including ``(?U)`` in the pattern.

``use_offset_limit``
  This option must be set to true for a pattern if you intend to use
  the ``offset_limit`` parameter in match and substitution operations
  to limit how far a string is searched for an unanchored match. If an
  ``offset_limit`` is set for an invocation of the ``match`` or
  ``sub`` methods or functions, but this option was not set to true
  for the pattern, then then the match fails.

``utf``
  If true, then both the pattern and the strings against which it is
  matched are processed as UTF-8 strings. If Unicode support was
  disabled when the PCRE2 library was built, which can be determined
  from ``config_bool(UNICODE)``, then the compile fails when ``utf``
  is true. See `pcre2unicode(3)`_ for details about Unicode support in
  PCRE2.

Match options
~~~~~~~~~~~~~

Match options affect the operation of matching in the ``match`` and
``sub`` methods and functions. By default, all of the BOOL options
are **false**. The INT options are 0 by default (meaning that they
are ignored, and the global defaults hold). The INT options MAY NOT
be less than 0; if they are, then the match fails.

``anchored``
  If true, then the match is constrained to match at the start of the
  string, regardless of whether the pattern is anchored. By default, a
  match is searched for anywhere in the string if the pattern is not
  anchored.

``len``
  If this INT value is greater than 0, it sets the length of the
  subject string to be matched. By default, the full string is matched.

``match_limit``
  If this INT value is greater than 0, it sets a limit to the effort
  used by the PCRE2 matching function to find a match. This can
  prevent matches from excessive backtracking, if there is a very
  large search space but a match is never found. It is equivalent to
  the varnishd parameter ``pcre_match_limit``, except that it applies
  only to the match operation in which it was set, not globally. The
  varnishd parameters for PCRE have no effect on this VMOD. By
  default, the match limit is imposed that was set for the PCRE2
  library at build time, which can discovered from
  ``config_int(MATCHLIMIT)``.

``not_bol``
  If true, the first character of the subject is string is not
  considered to be the beginning of a line, so the ``^`` metacharacter
  does not match before it. If the compile option ``multiline`` was
  not set to true for the pattern, then ``^`` never matches. This
  option only affects the circumflex metacharacter.

``not_eol``
  If true, the end of the subject string is not considered to be the
  end of a line, so the ``$`` metacharacter does not match after it.
  If ``multiline`` was not set to true for the pattern, then ``$``
  never matches. This option only affects the dollar metacharacter.

``not_empty``
  If true, then the empty string is not a valid match. If the matcher
  finds an empty match, then it considers other alternatives, and if
  no other valid matches are found, then the match fails.

``not_empty_atstart``
  If true, then the empty string is not a valid match at the start of
  the subject string. An empty string match later in the subject is
  permitted.

``no_jit``
  If true, then the just-in-time matcher is not used, even when the
  pattern was compiled for JIT. In that case, PCRE2's "traditional"
  interpretive matcher is used (as is always the case if JIT is not
  available, or if the pattern was not JIT-compiled). If ``no_jit`` is
  true for an invocation of the ``match()`` or ``sub()`` functions,
  which compile a pattern on every call, then the pattern is also not
  JIT-compiled. See `JIT compilation and matching`_ below.

``no_utf_check``
  If true, then the subject is not checked for validity as a UTF-8
  string when matched against a pattern for which ``utf`` was set to
  true. This may speed up matching, but should only be done if you
  are sure that the inputs are valid UTF-8. By default, UTF validity
  is checked for matches against patterns that were compiled with
  ``utf``.

``offset_limit``
  If this INT value is greater than 0, it limits how far an unanchored
  search can advance in the subject string. For example, if the
  pattern ``abc`` is matched against the string ``"123abc"`` and the
  offset limit is less than 3, the match fails. To use this parameter,
  the compile option ``use_offset_limit`` must have been set to true
  for the pattern at compile time; otherwise the match fails. By
  default, unanchored matches are searched for until the end of the
  string.

``recursion_limit``
  If this INT value is greater than 0, then it limits the depth of
  recursion for matches using the interpretive matcher. It is
  equivalent to the varnishd parameter ``pcre_match_limit_recursion``,
  but only applies to the individual match. This limits the depth of
  recursion and use of the stack for matches that may cause excessive
  recursion and stack overflow (which usually causes Varnish to
  crash). The limit is not relevant to the JIT matcher, and is ignored
  for JIT matching. By default, the recursion limit set for the PCRE2
  library at build time applies, which can be determined from
  ``config_int(RECURSIONLIMIT)``.

Substitution options
~~~~~~~~~~~~~~~~~~~~

The ``sub`` method and function use all of the match options (since
they run a match), and the following additional options. (The ``sub``
function also uses the compile options, since it compiles a pattern.)

``suball``
  If true, then the substitution iterates over the subject string and
  replaces every matching substring, making the substitution similar
  to the native VCL ``regsuball`` function. By default, only the first
  matching substring is replaced, making the substitution similar to
  VCL's ``regsub`` function.

``sub_extended``
  If true, then an extended syntax is enabled for the replacement
  string. Details of the replacement syntax are documented for the
  ``.sub()`` method below.

``unknown_unset``
  If true, then references to capturing groups in the replacement
  string that do not appear in the pattern are treated as unset
  groups.  By default, unknown references cause the substitution to
  fail. Use this option with care, because it causes misspelled group
  names or numbers to be silently ignored.

``unset_empty``
  If true, then unset capturing groups (including unknown groups when
  ``unknown_unset`` is also true) are replaced as empty strings. By
  default, an attempt to insert an unset group causes the substitution
  to fail.

JIT compilation and matching
----------------------------

PCRE2 supports just-in-time compilation for patterns, and a matcher to
go with it. JIT is a heavyweight optimization that may greatly speed
up matching, but requires extra processing at pattern compilation
time.  The VMOD supports JIT if it was enabled for the PCRE2 library
when it was built, which can be determined from ``config_bool(JIT)``.

If JIT is available, then it is always applied to the compilation of
patterns in the ``regex`` object constructor. By default it is also
applied when patterns are compiled at runtime in the ``match`` and
``sub`` methods and functions, unless the ``no_jit`` option is true.
For patterns compiled at runtime, it may be worth it to turn off JIT,
if the overhead for JIT-compiles outweighs the advantage of JIT
matching.

If JIT is not available, then PCRE2 always uses the interpretive
matcher.

Unicode
-------

The VMOD only links to the 8-bit version of PCRE2, and hence can
support UTF-8 if Unicode was enabled when the library was built. The
VMOD does not support UTF-16 or UTF-32. Thus the term "code unit", as
used for Unicode and in the PCRE2 documentation, always refers to one
byte.

In UTF mode, characters in patterns and the strings to be matched are
interpreted as UTF-8 code points, and hence may correspond to one to
four bytes. When UTF is not enabled, characters in patterns and
strings are represented by exactly one byte.

See `pcre2unicode(3)`_ for the details of PCRE2 Unicode support.

CONTENTS
========

* regex(STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL)
* BOOL match(PRIV_CALL, PRIV_TASK, STRING, STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL, INT, INT, INT, BOOL, BOOL, BOOL, BOOL, BOOL, INT)
* STRING backref(PRIV_TASK, INT, STRING)
* STRING namedref(PRIV_TASK, STRING, STRING)
* STRING sub(PRIV_CALL, PRIV_TASK, STRING, STRING, STRING, BOOL, BOOL, ENUM {ANYCRLF,UNICODE}, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, BOOL, STRING, BOOL, INT, BOOL, BOOL, BOOL, BOOL, ENUM {CR,LF,CRLF,ANYCRLF,ANY}, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL, INT, INT, INT, BOOL, BOOL, BOOL, BOOL, BOOL, INT, BOOL, BOOL, BOOL, BOOL)
* BOOL config_bool(ENUM {JIT,STACKRECURSE,UNICODE})
* STRING config_str(ENUM {BSR,JITTARGET,NEWLINE,UNICODE_VERSION,VERSION})
* INT config_int(ENUM {LINKSIZE,MATCHLIMIT,PARENSLIMIT,RECURSIONLIMIT})
* STRING version()

.. _obj_regex:

regex
-----

::

	new OBJ = regex(STRING pattern, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0)

Create a ``regex`` object from ``pattern`` according to the given
compile options (or option defaults). If the pattern is invalid, then
the VCL will fail to load, and the VCC compiler will emit an error
message.

Examples::

  sub vcl_init {

      # Match this pattern against the Host header (hence
      # case-insensitively), and capture part of the domain name.
      new domain = pcre2.regex("^www\.([^.]+)\.com$", caseless=true);

      # Match a max-age tag and capture the number.
      new maxage = pcre2.regex("max-age\s*=\s*(\d+)");

      # Group possible subdomains without capturing
      new submatcher = pcre2.regex("^www\.(domain1|domain2)\.com$",
	                           never_capture=true, caseless=true);
  }

.. _func_regex.match:

regex.match
-----------

::

	BOOL regex.match(PRIV_CALL, PRIV_TASK, STRING subject, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0)

Return ``true`` if the compiled regex matches the ``subject`` string,
as constrained by the given match options or option defaults.

The match may fail if any of the options are illegal for one of the
reasons given above, or if a limit such as the match or recursion
limit is reached. In that case, and error message is written to the
Varnish log using the ``VCL_Error`` tag, and the method returns
``false``.

If ``subject`` is undefined, for example if it is set from an unset
header variable, then it is assumed to be the empty string. This
follows VCL's handling of regex matching when the string to be matched
is unset.

Example::

  if (domain.match(req.http.Host)) {
     call do_on_match;
  }

.. _func_regex.backref:

regex.backref
-------------

::

	STRING regex.backref(INT ref, STRING fallback="**BACKREF METHOD FAILED**")

Returns the `nth` captured subexpression from the most recent
successful call of the ``.match()`` method for this object in the same
client or backend context, or a fallback string in case the capture
fails. Backref 0 indicates the entire matched string. Thus this
function behaves like the ``\n`` in the native VCL functions
``regsub`` and ``regsuball``, and the ``$1``, ``$2`` ... variables in
Perl. Unlike the regsubs, which limit the backref number to 0 through
9, ``backref`` permits any number that identifies a capturing group in
the pattern.

Since Varnish client and backend operations run in different threads,
``.backref()`` can only refer back to a ``.match()`` call in the same
thread. Thus a ``.backref()`` call in any of the ``vcl_backend_*``
subroutines -- the backend context -- refers back to a previous
``.match()`` in any of those same subroutines; and a call in any of
the other VCL subroutines -- the client context -- refers back to a
``.match()`` in the same client context.

After unsuccessful matches, the ``fallback`` string is returned for
any call to ``.backref()``. The default value of ``fallback`` is
``"**BACKREF METHOD FAILED**"``. ``.backref()`` always fails after a
failed match, even if ``.match()`` had been called successfully before
the failure.

``.backref()`` may also return ``fallback`` after a successful match,
if no captured group in the matching string corresponds to the backref
number. For example, when the pattern ``(a|(b))c`` matches the string
``ac``, there is no backref 2, since nothing matches ``b`` in the
string.

The VCL infix operators ``~`` and ``!~`` do not affect this method,
nor do the functions ``regsub`` or ``regsuball``. Nor is it affected
by the matches performed by any other method or function in this VMOD,
(the ``match()`` function or the ``sub`` method or function).

``.backref()`` fails, returning ``fallback`` and writing an error
message to the Varnish log with the ``VCL_Error`` tag, under the
following conditions (even if a previous match was successful and a
substring could have been captured):

* Any of the match options are illegal (for example, if one of the
  numeric limits was set to less than 0).

* The ``fallback`` string is undefined.

* ``ref`` (the backref number) is out of range -- if it is less than 0
  or larger than the highest number for a capturing group in the
  pattern.

* ``.match()`` was never called for this object in the task scope
  prior to calling ``.backref()``.

Example::

  if (domain.match(req.http.Host)) {
     set req.http.X-Domain = domain.backref(1);
  }

.. _func_regex.namedref:

regex.namedref
--------------

::

	STRING regex.namedref(STRING name, STRING fallback="**NAMEDREF METHOD FAILED**")

Returns the captured subexpression designated by ``name`` from the
most recent successful call to ``.match()`` in the current context
(client or backend), or ``fallback`` in case of failure. See
`pcre2pattern(3)`_ for details about the use of named subpatterns in
PCRE2 regexen.

Note that a named capturing group can also be referenced as a numbered
group -- the named groups are numbered exactly as if the names were
not present. So an expression returned by ``.namedref()`` will also be
returned by ``.backref()`` with the appropriate number.

``fallback`` is returned when ``.namedref()`` is called after an
unsuccessful match. The default fallback is ``"**NAMEDREF METHOD
FAILED**"``.

Like ``.backref()``, ``.namedref()`` is not affected by native VCL
regex operations, nor by any other matches performed by methods or
functions of the VMOD, except for a prior ``.match()`` for the same
object.

``.namedref()`` fails, returning ``fallback`` and logging a
``VCL_Error`` message, if:

* The ``fallback`` string is undefined.

* ``name`` is undefined.

* There is no such named group.

* ``.match()`` was not called for this object.

Example::

  sub vcl_init {
  	new domain = pcre2.regex("^www\.(?<domain>[^.]+)\.com$");
  }
  
  sub vcl_recv {
  	if (domain.match(req.http.Host)) {
  	   set req.http.X-Domain = domain.namedref("domain");
	}
  }

.. _func_regex.sub:

regex.sub
---------

::

	STRING regex.sub(PRIV_CALL, PRIV_TASK, STRING subject, STRING replacement, INT len=0, BOOL anchored=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, BOOL no_utf_check=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)

If the pattern represented by this object matches ``subject``, then
return a string formed by replacing the part that was matched by
``replacement``.  If the pattern does not match, then return the
``subject`` string unchanged. The match and substitution options affect
these operations as described above.

This method is similar to the native VCL ``regsub`` function, or
``regsuball`` when the ``suball`` option is true, but the syntax of
the replacement string is different. In the replacement string, these
sequences can be used to insert strings:

``$$``
  Inserts a dollar character.

``$<n>`` or ``${<n>}``
  Inserts the contents of group ``<n>`` captured during the match,
  where ``<n>`` can be a number or a name. The number can be 0 to
  include the entire matched string. Braces are only required if the
  following character would be interpreted as part of the number or
  name.

``$*MARK`` or ``${*MARK}``
  Insert the name of the last ``(*MARK)`` encountered in the match.

For example, to rewrite URLs with prefixes of the form ``"/~<user>"``
so that their prefix is ``"/u/<user>"`` (and leave other URLs
unchanged)::

  sub vcl_init {
      new user = pcre2.regex("/~([^/]+)(.*)", anchored=true);
  }
  
  sub vcl_recv {
      set req.url = user.sub(req.url, "/u/${1}${2}");
  }

When the ``sub_extended`` option is false, only the dollar character
is special in the replacement string. When ``sub_extended`` is true,
the replacement syntax also has these capabilites:

* Backslashes in the replacement string are interpreted as escapes,
  and special backslash sequences are interpreted as for PCRE2
  patterns.  For example, ``\n`` denotes newline, and ``\x{ddd}``,
  where each ``d`` is a digit, specifies a character code. A backslash
  followed by a non-alphanumeric character quotes the character, and
  ``\Q`` and ``\E`` can be used to quote a longer sequence.

* Four additional escape sequences can be used to force the case of
  inserted letters:

  * ``\U`` forces upper case for all of the following text until
    ``\E``, or to the end of the string if there is no ``\E``.

  * ``\L`` through ``\E`` or end of string forces lower case.

  * ``\u`` and ``\l`` force the next character, if it is a letter, to
    upper and lower case, respectively.

  Case forcing applies to all inserted characters, including those from
  captured groups and in sequences quoted by ``\Q`` through ``\E``.

  Sequences ending in ``\E`` do not nest. So for example,
  ``"\Uaa\LBB\Ecc\E"`` results in ``"AAbbcc"``, and the final ``\E`` has
  no effect.

* The "dollar" replacement expressions have an additional capability
  inspired by Bash to handle unset capturing groups:

  ``${<n>:-<string>}``
    As with ``${<n>}``, ``<n>`` can be a number or name. If group
    ``<n>`` is set, then its contents are inserted, otherwise
    ``<string>`` is expanded and inserted. ``<string>`` may, in turn,
    include elements of the replacement syntax that are interpreted
    accordingly.

  ``${<n>:+<string1>:<string2}``
    If group ``<n>`` is set, insert the result of expanding
    ``<string1>``, otherwise insert the result of expanding
    ``<string2>``.

  Colons and escapes in the replacement strings can be escaped with
  backslashes.

For example, to rewrite Host headers of the form
``www.<sub1>.<sub2>.<tld>`` to ``<sub2>.<tld>``, and of the form
``www.<sub>.<tld>`` to ``<sub>.<tld>``, while also normalizing the header
to lower-case, and leaving other Host headers unchanged::

  sub vcl_init {
      new hostsub = pcre2.regex(extended=true, pattern={"
                    "^www\.		# www. prefix
                    ([^.]+)		# group 1, "<sub1>"
                    (?:			# non-capturing parentheses
                      \.([^.]+)		# dot, then group 2, "<sub2>"
                    )?			# 0 or 1 of group 2
                    \.([^.]+)$		# dot, then group 3, "<tld>"
                    "});
  }

  sub vcl_recv {
      set req.http.Host = hostsub.sub(req.http.Host, sub_extended=true,
                                      replacement="\L${2:+$2:$1}.$3");
  }

``.sub()`` fails, returning NULL while logging a ``VCL_Error`` message,
if ``replacement`` is undefined.

.. _func_regex.info_bool:

regex.info_bool
---------------

::

	BOOL regex.info_bool(ENUM {ALLOW_EMPTY_CLASS,ANCHORED,ALT_BSUX,ALT_CIRCUMFLEX,ALT_VERBNAMES,CASELESS,DOLLAR_ENDONLY,DOTALL,DUPNAMES,EXTENDED,FIRSTLINE,MATCH_UNSET_BACKREF,MULTILINE,NEVER_BACKSLASH_C,NEVER_UCP,NEVER_UTF,NO_AUTO_CAPTURE,NO_AUTO_POSSESS,NO_DOTSTAR_ANCHOR,NO_START_OPTIMIZE,NO_UTF_CHECK,UCP,UNGREEDY,USE_OFFSET_LIMIT,UTF,HAS_FIRSTCODEUNIT,MATCH_ATSTART,HAS_LASTCODEUNIT,HAS_BACKSLASHC,HAS_CRORLF,JCHANGED,MATCH_EMPTY}, BOOL compiled=1)

Return true or false about a property of the regex that the object
represents.  This method and the other ``.info_*`` methods may be
helpful for debugging and optimizing regular expression matching, for
example by determining whether PCRE2 could enable certain
optimizations for the pattern.

The ENUM determines which property is to be inspected. If the ENUM is any
one of::

  ALLOW_EMPTY_CLASS, ANCHORED, ALT_BSUX, ALT_CIRCUMFLEX,
  ALT_VERBNAMES, CASELESS, DOLLAR_ENDONLY, DOTALL, DUPNAMES, EXTENDED,
  FIRSTLINE, MATCH_UNSET_BACKREF, MULTILINE, NEVER_BACKSLASH_C,
  NEVER_UCP, NEVER_UTF, NO_AUTO_CAPTURE, NO_AUTO_POSSESS,
  NO_DOTSTAR_ANCHOR, NO_START_OPTIMIZE, NO_UTF_CHECK, UCP, UNGREEDY,
  USE_OFFSET_LIMIT, UTF

then the return value of ``info_bool()`` indicates whether the
corresponding compile option is true for the pattern. If ``compiled``
is true, then the return indicates whether the option was set to true
after the pattern was compiled, even if it was specified differently
(or left to the default) in the object constructor.  If ``compiled``
is false, then the method returns the value of the option as it was
provided in the constructor.

For example, if the compile option ``anchored`` was set to false in
the constructor (or left to the default), PCRE2 may nevertheless
determine that the pattern is anchored if certain conditions are
satisfied (which are described in detail in `pcre2api(3)`_). In that
case, ``info_bool()`` will return true if ``compiled`` is true, and
false if ``compiled`` is false.

``compiled`` is true by default, and is ignored for the other ENUM
values.

The other ENUMs are interpreted as follows:

``HAS_FIRSTCODEUNIT``
  If the pattern is unanchored, PCRE2 may determine that there is a
  unique code unit (a byte) that must appear at the start of the
  matching part of a string. For example, the part of a string that
  matches ``(cat|cow|coyote)`` must begin with a
  ``c``. ``info_bool(HAS_FIRSTCODEUNIT)`` returns true if there is
  such a code unit, and false if the pattern is anchored or if no
  unique first code unit could be determined. If there is such a first
  code unit, it is returned by ``info_str(FIRSTCODEUNIT)``. Note that
  in non-UTF mode, the first code unit is the same as the first
  character, but for UTF-8 patterns, it may be the first byte in a
  multibyte character.

``MATCH_ATSTART``
  If the pattern is unanchored and no unique first code unit in the
  matching part of the string is known, PCRE2 may determine that the
  pattern is constrained to match at the start of the subject string,
  or following a newline in the subject. In that case,
  ``info_bool(MATCH_ATSTART)`` returns true; it returns false if the
  pattern is anchored, if a unique first code unit could be found, or
  if the pattern could not be determined to match at the start.

``HAS_LASTCODEUNIT``
  Under certain circumstances, PCRE2 may determine a rightmost literal
  code unit that must exist in a matching string, other than at the
  start. This is not necessarily the last byte in the matching part of
  a string, but rather the last literal code unit known to be
  required. For example, the ``b`` is recorded for this purpose for
  the pattern ``ab\d+``, although the ``b`` must be followed by
  digits. In there is such a last code unit,
  ``info_bool(HAS_LASTCODEUNIT)`` returns true, and that value can be
  retrieved from ``info_str(LASTCODEUNIT)``. For anchored patterns,
  PCRE2 records a possible last literal code unit only if a part of
  the pattern that comes before it has variable length. For example,
  ``z`` is recorded for ``^a\d+z\d+`` (because one or more digits must
  come before it), but none is recorded for ``^a\dz\d`` (because
  matching strings have a fixed length). As with the first code unit,
  the last code unit may be a byte in a multibyte UTF-8 character, if
  UTF is enabled for the pattern.

``HAS_BACKSLASHC``
  Return true if and only if ``\C`` appears in the pattern.

``HAS_CRORLF``
  Return true if and only if the pattern contains explicit matches for
  CR or LF characters. These can be literal carriage returns or
  linefeeds in the pattern, or the escape sequences ``\r`` or ``\n``.

``JCHANGED``
  Return true if and only if the pattern contains ``(?J)`` or ``(?-J)``
  to enable or disable JIT-matching.

``MATCH_EMPTY``
  Return true if and only if PCRE2 determines that the pattern might
  match the empty string. For certain complex patterns (with recursive
  subroutines), it may not be possible to determine; in that case,
  PCRE2 cautiously returns true.

Example::

  # To determine if the FIRSTCODEUNIT optimization could be applied.
  if (myregex.info_bool(HAS_FIRSTCODEUNIT)) {
      std.log("First matching char in the pattern = "
              + myregex.info_str(FIRSTCODEUNIT));
  }

.. _func_regex.info_int:

regex.info_int
--------------

::

	INT regex.info_int(ENUM {BACKREFMAX,CAPTURECOUNT,JITSIZE,MATCHLIMIT,MAXLOOKBEHIND,MINLENGTH,RECURSIONLIMIT,SIZE})

Return an integer that describes a property of the pattern that the
object represents, as determined by the ENUM.

``BACKREFMAX``
  Return the highest back reference within the pattern. Remember that
  named groups also acquire group numbers, and thus count towards the
  highest backref. A conditional subpattern such as ``(?(3)a|b)``,
  which checks if a capturing group is set, also counts as a
  backref. If there are no backrefs, return 0.

``CAPTURECOUNT``
  Return the highest capturing group number in the pattern. If the
  ``(?|`` construct (which allows duplicate group numbers, see
  `pcre2pattern(3)`_) is not used in the pattern, then the value
  returned is also the total number of capturing groups.

``JITSIZE``
  Return the size of JIT-compiled code for the pattern. Returns 0 if
  the pattern was not JIT-compiled.

``MATCHLIMIT``
  If the pattern contains the construct ``(*LIMIT_MATCH=nnnn)`` to set
  the match limit (see the match option ``match_limit`` above), then
  return the limit that it sets. Returns -1 if no such value has been
  set.

``MAXLOOKBEHIND``
  Return the number of characters in the longest lookbehind assertion
  in the pattern. Returns 0 if there are no lookbehinds.

``MINLENGTH``
  If PCRE2 has determined that there is a lower bound for the length
  of a string that may match the pattern, then return that
  value. Returns 0 if no lower bound is known. This is not necessarily
  the same as the shortest string that may possibly match; but any
  string that does match must be at least that long.

``RECURSIONLIMIT``
  If the pattern contains the construct ``(*LIMIT_RECUSRION=nnnn)``
  (see the match option ``recursion_limit`` above), then return the
  value that was set. Returns -1 if no such value has been set.

``SIZE``
  Return the size of the compiled pattern, as used for the
  interpretive matcher, in bytes. This is independent of the value
  returned by ``info_int(JITSIZE)``.

Example::

  # To determine if a lower bound on the length of matching strings
  # could be found.
  if (myregex.info_int(MINLENGTH) != 0) {
      std.log("Lower bound on matching string length = "
               myregex.info_int(MINLENGTH));
  }
  else {
      std.log("No lower bound for matching string lengths found");
  }

.. _func_regex.info_str:

regex.info_str
--------------

::

	STRING regex.info_str(ENUM {BSR,FIRSTCODEUNIT,FIRSTCODEUNITS,LASTCODEUNIT,NEWLINE}, STRING sep=" ")

Return a string that describes a property of the pattern represented
by the object, as determined by the ENUM. The ``sep`` parameter is
only relevant when the ENUM ``FIRSTCODEUNITS`` is used, as described
below.

``BSR``
  Return ``"UNICODE"``, meaning that ``\R`` in the pattern matches any
  Unicode line ending sequence, or ``"ANYCRLF"``, meaning that it
  matches only CR, LF or CRLF.

``FIRSTCODEUNIT``
  If PCRE2 determines that there is a unique first code unit that must
  begin the matching part of a string (as described above for
  ``info_bool(HAS_FIRSTCODEUNIT)``), then return that code unit in a
  string.  Returns the empty string if no such code unit was
  determined; this is also the case if the pattern is anchored. Recall
  that a code unit corresponds to a character in non-UTF mode, but may
  be a byte in a multibyte character when UTF-8 is enabled. The code
  unit is not escaped in the return string.

``FIRSTCODEUNITS``
  (Note the difference between ``FIRSTCODEUNIT``, singular, and
  ``FIRSTCODEUNITS``, plural.) For an unanchored pattern, if PCRE2
  cannot determine a unique code unit that must appear at the start of
  the matching part of a string, it may be able to determine a set of
  such code units. For example, if the pattern starts with ``[abc]``,
  then the matching part must begin with ``a``, ``b`` or ``c``. In
  that case, ``info_str(FIRSTCODEUNITS)`` returns those code units in
  a string, separated by the string given as ``sep``. The default
  value of ``sep`` is ``" "`` (the string containing one space). If
  the pattern is anchored, or if a unique first code unit could be
  found, or if no set of first code units could be found, then return
  the empty string.

``LASTCODEUNIT``
  If PCRE2 has recorded a rightmost literal code unit that must exist
  in a matching string, as described for ``info_bool(HAS_LASTCODEUNIT)``
  above, then return that code unit in a string. Returns the empty
  string if no such code unit was recorded.

``NEWLINE``
  Return a string describing the default sequence recognized as a
  "newline" for the pattern:

  * ``"CR"`` (carriage return)
  * ``"LF"`` (linefeed)
  * ``"CRLF"`` (CR followed by LF)
  * ``"ANYCRLF"`` (CR, LF or CRLF)
  * ``"UNICODE"`` (any Unicode line-ending sequence)

Example::

  # Determine if a set of first matching characters could be found.
  std.log("First matching chars: " + myregex.info_str(FIRSTCODEUNITS));

Regex functional interface
--------------------------

.. _func_match:

match
-----

::

	BOOL match(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0)

Compile the ``pattern`` and return true if it matches
``subject``. Compilation and matching are subject to the given
options, or default options. The compiled pattern is discarded after
use, and ``pattern`` is compiled on every invocation.

The call fails, logging an ``VCL_Error`` message and returning false,
if:

* ``pattern`` is undefined.

* The compile fails (for example due to a syntax error).

* Any compile or match option is illegal as described above.

As with the ``.match()`` method, if ``subject`` is undefined, then it
is assumed to be the empty string.
  
Example::

  # Match a request header against a pattern provided in a response
  # header.
  if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
      call do_on_match;
  }

.. _func_backref:

backref
-------

::

	STRING backref(PRIV_TASK, INT ref, STRING fallback="**BACKREF FUNCTION FAILED**")

Return the `nth` captured subexpression from the most recent
successful call of the ``match()`` function in the current client or
backend context, or a fallback string if the capture fails. The
default ``fallback`` is ``"**BACKREF FUNCTION FAILED**"``.

As with the ``regex.backref()`` method, ``fallback`` is returned
after any failed invocation of the ``match()`` function, or if there
is no captured group corresponding to the backref number. The function
is not affected by native VCL regex operations, or any other method or
function of the VMOD except for the ``match()`` function.

The function fails, returning ``fallback`` and logging a ``VCL_Error``
message, under the same conditions as the corresponding method:

* ``fallback`` is undefined.
* ``ref`` is out of range.
* The ``match()`` function was never called in this context.
* The pattern failed to compile for the previous ``match()`` call.

Example::

  # Match against a pattern provided in a response header, and capture
  # subexpression 1.
  if (pcre2.match(resp.http.X-Pattern, req.http.X-Subject)) {
     set resp.http.X-Group-1 = pcre2.backref(1);
  }

.. _func_namedref:

namedref
--------

::

	STRING namedref(PRIV_TASK, STRING name, STRING fallback="**NAMEDREF FUNCTION FAILED**")

Return the captured subexpression designated by ``name`` from the most
recent successful call of the ``match()`` function in the current
context, or ``fallback`` in case of failure. The default fallback is
``"**NAMEDREF FUNCTION FAILED**"``.

The function returns ``fallback`` when the previous invocation of the
``match()`` function failed, and is only affected by use of the
``match()`` function. The function fails, returning ``fallback`` and
logging a ``VCL_Error`` message, under the same conditions as the
corresponding method:

* ``fallback`` is undefined.
* ``name`` is undefined or the empty string.
* There is no such named group.
* ``match()`` was not called in this context.
* The pattern failed to compile for the previous ``match()`` call.

Example::

  if (pcre2.match(resp.http.X-Pattern-With-Names, req.http.X-Subject)) {
     set resp.http.X-Group-Foo = pcre2.namedref("foo");
  }

.. _func_sub:

sub
---

::

	STRING sub(PRIV_CALL, PRIV_TASK, STRING pattern, STRING subject, STRING replacement, BOOL allow_empty_class=0, BOOL anchored=0, ENUM {ANYCRLF,UNICODE} bsr=0, BOOL alt_bsux=0, BOOL alt_circumflex=0, BOOL alt_verbnames=0, BOOL caseless=0, BOOL dollar_endonly=0, BOOL dotall=0, BOOL dupnames=0, BOOL extended=0, BOOL firstline=0, STRING locale=0, BOOL match_unset_backref=0, INT max_pattern_len=0, BOOL multiline=0, BOOL never_backslash_c=0, BOOL never_ucp=0, BOOL never_utf=0, ENUM {CR,LF,CRLF,ANYCRLF,ANY} newline=0, BOOL no_auto_capture=0, BOOL no_auto_possess=0, BOOL no_dotstar_anchor=0, BOOL no_start_optimize=0, BOOL no_utf_check=0, INT parens_nest_limit=0, BOOL ucp=0, BOOL ungreedy=0, BOOL use_offset_limit=0, BOOL utf=0, INT len=0, INT match_limit=0, INT offset_limit=0, BOOL notbol=0, BOOL noteol=0, BOOL notempty=0, BOOL notempty_atstart=0, BOOL no_jit=0, INT recursion_limit=0, BOOL suball=0, BOOL sub_extended=0, BOOL unknown_unset=0, BOOL unset_empty=0)

Compile ``pattern``, and if it matches ``subject``, then return a string
formed by replacing the part that matched by ``replacement``. If the
pattern does not match, return ``subject`` unchanged. The compile, match
and substitution options affect all of these operations, as described
above.

The syntax of the ``replacement`` string, as modified if the
``sub_extended`` option is true, is the same as documented above for
the ``.sub()`` method.

``sub()`` fails, returning NULL and logging a ``VCL_Error`` message,
if:

* Either of ``pattern`` or ``replacement`` is undefined.

* ``pattern`` cannot be compiled.

Example::

  # If the beresp header X-Sub-Letters contains "b+", and Host contains
  # "www.yabba.dabba.doo.com", then set X-Yada to
  # "www.yada.dabba.doo.com".
  set beresp.http.X-Yada = re2.sub(beresp.http.X-Sub-Letters,
                                   bereq.http.Host, "d");

Library configuration
---------------------

.. _func_config_bool:

config_bool
-----------

::

	BOOL config_bool(ENUM {JIT,STACKRECURSE,UNICODE})

Return true or false about a property of the PCRE2 library to which
the VMOD is linked, identified by the ENUM. The ``config_*`` functions
make it possible to discover features of the library that were chosen
when it was built.

``JIT``
  Return true if the library supports just-in-time compilation and
  matching.

``STACKRECURSE``
  Return true if internal recursion for the PCRE2 matcher uses the
  system stack to maintain its state, which is the usual way the
  library is built. If false is returned, PCRE2 uses blocks of data on
  the heap rather than recursive function calls.

``UNICODE``
  Return true if Unicode support is available. If so, then the compile
  option ``utf`` can be used to define a pattern and the strings
  against which it is matched as UTF-8 strings.

Example::

  if (pcre2.config_bool(JIT)) {
      std.log("JIT supported for PCRE2");
  }
  else {
      std.log("JIT not supported for PCRE2");
  }

.. _func_config_str:

config_str
----------

::

	STRING config_str(ENUM {BSR,JITTARGET,NEWLINE,UNICODE_VERSION,VERSION})

Return a string describing a property of the PCRE2 library.

``BSR``
  Return a string indicating what the ``\R`` escape sequence matches
  by default: ``UNICODE`` for Unicode line-ending sequences, or
  ``ANYCRLF`` for only CR, LF and CRLF. This is the default that holds
  if no value is given for the compile option ``bsr``.

``JITTARGET``
  Return a string identifying the architecture for which the JIT
  compiler is configured. If JIT is not enabled, the returned string
  contains the phrase ``"JIT not supported"``.

``NEWLINE``
  Return a string identifying the character sequence that is recognized
  by default as a newline:

  * ``"CR"`` (carriage return)
  * ``"LF"`` (linefeed)
  * ``"CRLF"`` (CR followed by LF)
  * ``"ANY"`` (any Unicode line ending)
  * ``"ANYCRLF"`` (any of CR, LF or CRLF)

  This is the default if no value is given for the compile option
  ``newline``.

``UNICODE_VERSION``
  If Unicode is supported by the library, return the Unicode version
  string. If not, return ``"Unicode not supported"``.

``VERSION``
  Return the PCRE2 version string.

Example::

  std.log("Linked to PCRE2 version " + pcre2.config_str(VERSION));

.. _func_config_int:

config_int
----------

::

	INT config_int(ENUM {LINKSIZE,MATCHLIMIT,PARENSLIMIT,RECURSIONLIMIT})

Return an integer describing a property of the PCRE2 library.

``LINKSIZE``
  Return the number of bytes used for internal linkage (offsets) in
  compiled regular expressions. This determines the size of the
  largest possible pattern; the default link size of 2 allows for
  patterns of up to 64K bytes.

``MATCHLIMIT``
  Return the default value for the ``match_limit`` compile option,
  which limits the effort of the matcher when no match is found.

``PARENSLIMIT``
  Return the default value of the ``parens_nest_limit`` compile
  option, which limits the depth of parenthesis nesting in patterns,
  and hence the use of the stack during compilation.

``RECURSIONLIMIT``
  Return the default value of the ``recursion_limit`` compile option,
  which limits the depth of recursion, and hence stack usage, for the
  the interpretive (non-JIT) matcher.

Example::

  std.log("Default PCRE2 match limit = " + config_int(MATCHLIMIT));

.. _func_version:

version
-------

::

	STRING version()

Returns the version string for this VMOD.

Example::

  std.log("Using VMOD pcre2 version " + pcre2.version());

REQUIREMENTS
============

This VMOD has been tested with Varnish version 5.1.2 and PCRE2 version
10.23.

INSTALLATION
============

See `INSTALL.rst <INSTALL.rst>`_ in the source repository.

LIMITATIONS
===========

The VMOD allocates Varnish workspace for a variety of purposes:

* The string returned by the ``sub`` method and function.

* Buffers for temporary data structures needed by the PCRE2 library,
  for example to save information about a match for use by the
  ``backref`` and ``namedref`` methods and functions.

* A copy of the subject string for the ``match`` method and function,
  if it is not already in workspace, so that it can be safely accessed
  by ``backref`` and ``namedref``.

* Return strings for some uses of ``info_str`` and ``config_str``.

* Temporary buffers for error message strings from the PCRE2 library.

If VMOD operations fail with the "out of space" error message in the
Varnish log (with the ``VCL_Error`` tag), increase the varnishd runtime
parameters ``workspace_client`` and/or ``workspace_backend``.

The PCRE2 interpretive and JIT matchers are backtracking matchers, and
the interpretive matcher is recursive, using part of the stack on each
recursive call (in the default library configuration). For patterns
with large search spaces, this can lead to slow matches, high CPU
usage, and stack overflow due to deep recursion, which typically
causes Varnish to segfault. This has occasionally been the subject of
issues reported to the Varnish project.

For most common uses of regular expressions in VCL, PCRE2 is very fast
and has minimal resource consumption. This depends strongly on how the
regex is written -- a well-crafted pattern helps the matcher limit
backtracking, fail early on non-matches, and make use of some the
optimizations that PCRE2 can apply. Some of the compile and match
options also help to optimize the match operation. Which of these
measures is possible depends, of course, on what you want the regex to
do.

Writing optimized regexen is a very broad subject, beyond the scope of
this manual. There is some advice in `pcre2perform(3)`_, and in many
other sources.

If your use case requires patterns and subject strings that can lead
to very large search spaces, consider using some of the options
available in the VMOD that limit excessive effort for unsuccessful
matches. In particular, consider lowering the match options
``match_limit`` and ``recursion_limit``. You can also use
``offset_limit`` to set a maximum length to search for a match in the
subject string (for which you will have to set the compile option
``use_offset_limit``). These may cause the matcher to halt before it
has exhausted all possibilities for a match (but it appears to be
common that, if the matcher has to search for a long time, then there
was never any match to be found).

If you encounter stack overflow, it may help to increase the stack
size (by changing ``limits.conf`` or calling ``ulimit -s`` before
starting Varnish). Since Varnish 4.1, you can also increase the
varnishd parameter ``thread_pool_stack``. Bear in mind that this
increases the total RAM usage of Varnish.

ACKNOWLEDGEMENTS
================

A tip of the hat to Philip Hazel, who released the first version of
PCRE twenty years before this VMOD was developed.

A few sentences in this manual are identical to or very closely track
phrasings in the PCRE2 documentation, if there was simply no better
way to say what needs to be said.

SEE ALSO
========

* varnishd(1)
* vcl(7)
* pcre2(3)
* PCRE web site: http://www.pcre.org/
* VMOD source repository: https://code.uplex.de/uplex-varnish/libvmod-pcre2

.. _pcre2(3): http://www.pcre.org/current/doc/html/pcre2.html
.. _pcre2pattern(3): http://www.pcre.org/current/doc/html/pcre2pattern.html
.. _pcre2syntax(3): http://www.pcre.org/current/doc/html/pcre2syntax.html
.. _pcre2api(3): http://www.pcre.org/current/doc/html/pcre2api.html
.. _pcre2unicode(3): http://www.pcre.org/current/doc/html/pcre2unicode.html
.. _pcre2perform(3): http://www.pcre.org/current/doc/html/pcre2perform.html

COPYRIGHT
=========

::

  This document is licensed under the same conditions
  as the libvmod-pcre2 project. See LICENSE for details.
 
  Author: Geoffrey Simmons <geoffrey.simmons@uplex.de>