Varnish Module for Regular Expression Matching with Subexpression Capture
Find a file
2025-09-15 17:45:59 +02:00
m4 Initial 2013-09-09 05:33:26 +02:00
pkg/rpm Fix previous: if COPYING is removed, it should be from the Makefile and rpm 2022-12-08 14:03:48 +01:00
src test: Fix c17.vtc 2025-08-26 11:30:27 +02:00
.clang-tidy CI: Introduce clang-tidy 2025-08-26 11:30:27 +02:00
.dir-locals.el add dir-locals for Emacs hackers 2014-11-23 16:00:22 +01:00
.gitignore Update build to vcdk 2025-01-31 11:44:39 +01:00
.gitlab-ci.yml CI: Fix for Ubuntu 2025-08-26 11:31:09 +02:00
bootstrap Update build to vcdk 2025-01-31 11:44:39 +01:00
CHANGES.rst Add .foreach_body() 2023-07-20 16:20:47 +02:00
configure.ac build: Don't use vcs_vmod_version if not present 2025-08-26 11:30:27 +02:00
INSTALL.rst INSTALL.rst: Add autoconf-archive 2025-05-31 20:26:48 +02:00
LICENSE Standardize LICENSE 2022-12-01 16:24:32 +01:00
Makefile.am build: fix vmod_vcs_version.txt handling 2025-08-26 11:30:27 +02:00
README.rst update generated documentation 2025-09-15 17:45:59 +02:00

..
.. NB:  This file is machine generated, DO NOT EDIT!
..
.. Edit ./vmod_re.vcc and run make instead
..

.. role:: ref(emphasis)

=======
vmod_re
=======

-------------------------------------------------------------------------
Varnish Module for Regular Expression Matching with Subexpression Capture
-------------------------------------------------------------------------

:Manual section: 3

SYNOPSIS
========

::

  import re;

  # object interface
  new <obj> = re.regex(STRING [, INT limit] [, INT limit_recursion]
                       [, BOOL forbody])
  BOOL <obj>.match(STRING [, INT limit] [, INT limit_recursion])
  STRING <obj>.backref(INT [, STRING fallback])
  BOOL <obj>.match_body(req_body | bereq_body | resp_body
                        [, INT limit] [, INT limit_recursion])

  # Iterators
  BOOL <obj>.foreach(STRING, SUB, [, INT limit] [, INT limit_recursion])
  BOOL <obj>.foreach_body(req_body | bereq_body | resp_body, SUB,
                          [, INT limit] [, INT limit_recursion])

  # filter interface (includes all of the above)
  new <obj> = re.regex(STRING [, INT limit] [, INT limit_recursion]
                       , forbody=true)
  <obj>.substitute_match(INT, STRING)
  set [be]resp.filters = "<obj>"

  # function interface
  BOOL re.match_dyn(STRING [, INT limit] [, INT limit_recursion])
  STRING re.backref_dyn(INT [, STRING fallback])

  STRING re.version()

DESCRIPTION
===========

.. _regsub(): https://varnish-cache.org/docs/trunk/reference/vcl.html#regsub-str-regex-sub

.. _regsuball(): https://varnish-cache.org/docs/trunk/reference/vcl.html#regsuball-str-regex-sub

.. _beresp.filters: https://varnish-cache.org/docs/trunk/reference/vcl-var.html#beresp-filters

.. _resp.filters: https://varnish-cache.org/docs/trunk/reference/vcl-var.html#resp-filters

Varnish Module (VMOD) for matching strings against regular expressions,
and for extracting captured substrings after matches.

Regular expression matching as implemented by the VMOD is equivalent
to VCL's infix operator ``~``. The VMOD is motivated by the fact that
backreference capture in standard VCL requires verbose and suboptimal
use of the `regsub()`_ or `regsuball()`_ functions. For example, this
common idiom in VCL captures a string of digits following the
substring ``"bar"`` from one request header into another::

	sub vcl_recv {
		if (req.http.Foo ~ "bar\d+")) {
		   set req.http.Baz = regsub(req.http.Foo,
                                             "^.*bar(\d+).*$", "\1");

		}
	}

It requires two regex executions when a match is found, the second one
less efficient than the first (since it must match the entire string
to be replaced while capturing a substring), and is just cumbersome.

The equivalent solution with the VMOD looks like this::

	import re;

	sub vcl_init {
		new myregex = re.regex("bar(\d+)");
	}

	sub vcl_recv {
		if (myregex.match(req.http.Foo)) {
		   set req.http.Baz = myregex.backref(1);
		}
	}

For an example on body matching, see `xregex.match_body()`_.

The object is created at VCL initialization with the regex containing
the capture expression, only describing the substring to be
matched. When a match with the ``match`` or ``match_body`` method
succeeds, then a captured string can be obtained from the ``backref``
method.

Calls to the ``backref`` method refer back to the most recent call to ``match``
or ``match_body`` for the same object in the same task scope; that is, in the
same client or backend context. For example if ``match`` is called for an object
in one of the ``vcl_backend_*`` subroutines and returns ``true``, then
subsequent calls to ``backref`` in the same backend scope extract substrings
from the matched substring. For an unsuccessful match, all back references are
cleared.

By setting the ``asfilter`` parameter to true, a regex object can also
be configured to add a filter for performing substitutions on
bodies. See `xregex.substitute_match()`_ for details and examples.

The VMOD also supports dynamic regex matching with the ``match_dyn``
and ``backref_dyn`` functions::

	import re;

	sub vcl_backend_response {
		if (re.match_dyn(beresp.http.Bar + "(\d+)",
		                      req.http.Foo)) {
		   set beresp.http.Baz = re.backref_dyn(1);
		}
	}

In ``match_dyn``, the regex in the first argument is compiled when it
is called, and matched against the string in the second
argument. Subsequent calls to ``backref_dyn`` extract substrings from
the matched string for the most recent successful call to
``match_dyn`` in the same task scope.

As with the constructor, the regex argument to ``match_dyn`` should
contain any capturing expressions needed for calls to ``backref_dyn``.

``match_dyn`` makes it possible to construct regexen whose contents
are not fully known until runtime, but ``match`` is more efficient,
since it re-uses the compiled expression obtained at VCL
initialization. So if you are matching against a fixed pattern that
never changes during the lifetime of VCL, use ``match``.

.. _re.regex():

new xregex = re.regex(STRING, INT limit, INT limit_recursion, BOOL forbody, BOOL asfilter)
------------------------------------------------------------------------------------------

::

   new xregex = re.regex(
      STRING,
      INT limit=1000,
      INT limit_recursion=1000,
      BOOL forbody=0,
      BOOL asfilter=0
   )

Description
	Create a regex object with the given regular expression. The
	expression is compiled when the constructor is called. It
	should include any capturing parentheses that will be needed
	for extracting backreferences.

	If the regular expression fails to compile, then the VCL
	load fails with an error message describing the problem.

	The optional parameters ``limit`` and ``limit_recursion`` are
	per-object defaults for the respective parameters of the
	`xregex.match()`_ method.

	The optional parameter ``forbody`` is required if the
	`xregex.match_body()`_ method is to be called on the
	object.

	If the optional ``asfilter`` parameter is true, the vmod
	registers itself as a Varnish Fetch Processor (VFP) for use in
	`beresp.filters`_ and as a Varnish Delivery Processor (VDP)
	for use in `resp.filters`_. In this setup, the
	`xregex.substitute_match()`_ and `xregex.substitute_all()`_
	methods can be used to define replacements for matches on the
	body.

Example
	``new myregex = re.regex("\bmax-age\s*=\s*(\d+)");``

.. _xregex.match():

BOOL xregex.match(STRING, INT limit, INT limit_recursion)
---------------------------------------------------------

::

      BOOL xregex.match(STRING, INT limit=0, INT limit_recursion=0)

Description
	Determines whether the given string matches the regex compiled
	by the constructor; functionally equivalent to VCL's infix
	operator ``~``.

	The optional parameter ``limit`` restricts the number of
	internal matching function calls in a ``pcre_exec()``
	execution, analogous to the varnishd ``pcre_match_limit``
	parameter. For the default value 0, the ``limit`` given to the
	constructor `re.regex()`_ is used.

	The optional parameter ``limit_recursion`` restricts the
	number of internal matching function recursions in a
	``pcre_exec()`` execution, analogous to the varnishd
	``pcre_match_limit_recursion`` parameter.  For the default
	value 0, the ``limit_recursion`` given to the constructor
	`re.regex()`_ is used.

Example
	``if (myregex.match(beresp.http.Surrogate-Control)) { # ...``

.. _xregex.foreach():

BOOL xregex.foreach(STRING, SUB sub, INT limit, INT limit_recursion)
--------------------------------------------------------------------

::

      BOOL xregex.foreach(
            STRING,
            SUB sub,
            INT limit=0,
            INT limit_recursion=0
      )

Description
	Calls subroutine *sub* as if `xregex.match()`_ was run for all
	matches on the given string. If there are no matches, the
	subroutine is not called. `xregex.backref()`_ can be used to
	retrieve the match constituents.

Example::

	sub vcl_init {
		new myregex = re.regex("bar(\d+)");
	}

	sub myregex_collect {
		set resp.http.all += myregex.backref(0);
	}

	sub vcl_synth {
		unset resp.http.all;
		myregex.foreach(req.http.input, myregex_collect);
	}

	sub vcl_recv {
		return (synth(200));
	}

*Note* This is a toy example, and if the purpose really is to collect
all matches, `regsuball()`_ is way more efficient.

.. _xregex.match_body():

BOOL xregex.match_body(ENUM which, INT limit, INT limit_recursion)
------------------------------------------------------------------

::

      BOOL xregex.match_body(
            ENUM {req_body, bereq_body, resp_body} which,
            INT limit=0,
            INT limit_recursion=0
      )

.. _multi segment matching: https://pcre.org/current/doc/html/pcre2partial.html#SEC4

Description
	Like `xregex.match()`_, except that it operates on the named body.

	For a regular expression to be used with this method, it needs
	to be constructed with the ``forbody`` flag set in the
	`re.regex()`_ constructor. Calling this method when the flag
	was unset results in a VCL failure.

	PCRE2 `multi segment matching`_ is used to implement this
	method to reduce memory requirements. In particular, unlike
	implementations in other vmods, this implementation does _not_
	read the full body object into a contiguous memory region. It
	might, however, require as much temporary heap space as all
	body segments which the match found by the pattern spans.

	Under ideal conditions, when the pattern spans only a single
	segment of a cached object, the `xregex.match_body()`_ method
	does not create copies of the body data.

	When used with a ``req_body`` or ``bereq_body`` *which*
	argument, this method consumes the request body. If it is to
	be used again (for example, to send it to a backend), it
	should first be cached by calling ``std.cache_req_body(<size>)``.

	Lookarounds are not supported.

Example::

	sub vcl_init {
		new pattern = re.regex("(a|b)=([^&]*).*&(a|b)=([^&]*)",
		    forbody=true);
	}

	sub vcl_recv {
		if (pattern.match_body(req_body)) {
			return (synth(42200));
		}
	}

	sub vcl_synth {
		if (resp.status == 42200) {
			set resp.http.n1 = pattern.backref(1, "");
			set resp.http.v1 = pattern.backref(2, "");
			set resp.http.n2 = pattern.backref(3, "");
			set resp.http.v2 = pattern.backref(4, "");
			set resp.body = "";
			return (deliver);
		}
	}

	# response contains first parameter named a or b from the body as n1,
	# first value as v1, and the second parameter and value as n2
	# and v2

.. _xregex.foreach_body():

BOOL xregex.foreach_body(ENUM which, SUB sub, INT limit, INT limit_recursion)
-----------------------------------------------------------------------------

::

      BOOL xregex.foreach_body(
            ENUM {req_body, bereq_body, resp_body} which,
            SUB sub,
            INT limit=0,
            INT limit_recursion=0
      )

Description
	Calls subroutine *sub* as if `xregex.match()`_ was run for all
	matches on the given body. If there are no matches, the
	subroutine is not called. `xregex.backref()`_ can be used to
	retrieve the match constituents.

	See also `xregex.match_body()`_.

Example::

	# for key=value separated by &, collect two a and/or b key pairs
	#
	# sample output: a=1,b=22;b=333,a=4444;
	#

	sub vcl_init {
		new pattern = re.regex("(?:^|&)(a|b)=([^&]*).*?&(a|b)=([^&]*)",
		    forbody=true);
	}

	sub collect {
		set resp.http.all +=
		    pattern.backref(1) + "=" + pattern.backref(2) + "," +
		    pattern.backref(3) + "=" + pattern.backref(4) + ";";
	}

	sub vcl_synth {
		unset resp.http.all;
		if (pattern.foreach_body(req_body, collect)) {
			set resp.status = 200;
		}
		return (deliver);
	}

	sub vcl_recv {
		return (synth(400));
	}

.. _xregex.backref():

STRING xregex.backref(INT, STRING fallback)
-------------------------------------------

::

      STRING xregex.backref(
            INT,
            STRING fallback="**BACKREF METHOD FAILED**"
      )

Description
	Extracts the `nth` subexpression of the most recent successful
	call of the ``match`` method for this object in the same task
	scope (client or backend context), or a fallback string in
	case the extraction fails.  Backref 0 indicates the entire
	matched string. Thus this function behaves like the ``\n``
	symbols in `regsub()`_ and `regsuball()`_, and the ``$1``,
	``$2`` ...  variables in Perl.

	After unsuccessful matches, the ``fallback`` string is returned
	for any call to ``backref``. The default value of ``fallback``
	is ``"**BACKREF METHOD FAILED**"``.

	The VCL infix operators ``~`` and ``!~`` do not affect this
	method, nor do the functions `regsub()`_ or `regsuball()`_.

	If ``backref`` is called without any prior call to ``match``
	for this object in the same task scope, then an error message
	is emitted to the Varnish log using the ``VCL_Error`` tag, and
	the fallback string is returned.

	Lookarounds are not supported.

Example
        ``set beresp.ttl = std.duration(myregex.backref(1, "120"), 120s);``

.. _xregex.substitute_match():

VOID xregex.substitute_match(INT, STRING)
-----------------------------------------

Description
	This method defines substitutions for regular expression
	replacement ("regsub") operations on HTTP bodies.

	It can only be used on `re.regex()`_ objects initiated with
	the ``asfilter`` argument set to ``true``, or a VCL failure
	will be triggered.

	The INT argument defines to which match the substitution is to
	be applied: For ``1``, it applies to the first match, for
	``2`` to the second etc. A value of ``0`` defines the default
	substitution which is applied if a specific substitution is
	not defined. Negative values trigger a VCL failure.

	If no substitution is defined for a match (and there is no
	default), the matched sub-string is left unchanged.

	The STRING argument defines the substitution to apply, exactly
	like the ``sub`` (third) argument of the `regsub()`_ built-in
	VCL function: ``\0`` (which can also be spelled ``\&``) is
	replaced with the entire matched string, and ``\n`` is
	replaced with the contents of subgroup *n* in the matched
	string.

	To have any effect, the regex object must be used as a fetch
	or delivery filter.

Example
	For occurrences of the string "reiher" in the response body,
	replace the first with "czapla", the second with "eier" and
	all others with "heron". The response is returned uncompressed
	even if the client supported compression because there
	currently is no ``gzip`` VDP in Varnish-Cache::

	    sub vcl_init {
		new reiher = re.regex("r(ei)h(er)", asfilter = true);
	    }
	    sub vcl_deliver {
		unset req.http.Accept-Encoding;
		set resp.filters += " reiher";
		reiher.substitute_match(1, "czapla");
		reiher.substitute_match(2, "\1\2");
		reiher.substitute_match(0, "heron");
	    }

.. _xregex.substitute_all():

VOID xregex.substitute_all(STRING)
----------------------------------

Description
	This method instructs the named filter object to replace all
	matches with the STRING argument.

	It is a shorthand for calling::

	  xregex.clear_substitutions();
	  xregex.substitute_match(0, STRING);

	See `xregex.substitute_match()`_ for when to use this method.

.. _xregex.clear_substitutions():

VOID xregex.clear_substitutions()
---------------------------------

Description
	This method clears all previous substitution definions through
	`xregex.substitute_match()`_ and `xregex.substitute_all()`_.

	It is not required because VCL code could always be written
	sucht hat only one code patch ever calls
	`xregex.substitute_match()`_ and `xregex.substitute_all()`_,
	but it is provided to allow for simpler VCL for handling
	exceptional cases.

	See `xregex.substitute_match()`_ for when to use this method.

.. _re.match_dyn():

BOOL match_dyn(STRING, STRING, INT limit, INT limit_recursion)
--------------------------------------------------------------

::

   BOOL match_dyn(
      STRING,
      STRING,
      INT limit=1000,
      INT limit_recursion=1000
   )

Description
	Compiles the regular expression given in the first argument,
	and determines whether it matches the string in the second
	argument.

	If the regular expression fails to compile, then an error
	message describing the problem is emitted to the Varnish log
	with the tag ``VCL_Error``, and ``match_dyn`` returns
	``false``.

	For parameters ``limit`` and ``limit_recursion`` see
	`xregex.match()`_, except that there is no object to inherit
	defaults from.

Example
	``if (re.match_dyn(req.http.Foo + "(\d+)", beresp.http.Bar)) { # ...``

.. _re.backref_dyn():

STRING backref_dyn(INT, STRING fallback)
----------------------------------------

::

   STRING backref_dyn(
      INT,
      STRING fallback="**BACKREF FUNCTION FAILED**"
   )

Description
	Similar to the ``backref`` method, this function extracts the
	`nth` subexpression of the most recent successful call of the
	``match_dyn`` function in the same task scope, or a fallback
	string in case the extraction fails.

	After unsuccessful matches, the ``fallback`` string is returned
	for any call to ``backref_dyn``. The default value of ``fallback``
	is ``"**BACKREF FUNCTION FAILED**"``.

	If ``backref_dyn`` is called without any prior call to ``match_dyn``
	in the same task scope, then a ``VCL_Error`` message is logged, and
	the fallback string is returned.

.. _re.version():

STRING version()
----------------

Description
        Returns the version string for this vmod.

Example
        ``set resp.http.X-re-version = re.version();``

REQUIREMENTS
============

The VMOD requires the Varnish since version 6.0.0 or the master
branch. See the project repository for versions that are compatible
with other versions of Varnish.

LIMITATIONS
===========

The VMOD allocates memory for captured subexpressions from Varnish
workspaces, whose sizes are determined by the runtime parameters
``workspace_backend``, for calls within the ``vcl_backend_*``
subroutines, and ``workspace_client``, for the other VCL subs. The
VMOD copies the string to be matched into the workspace, if it's not
already in the workspace, and also uses workspace to save data about
backreferences.

For typical usage, the default workspace sizes are probably enough;
but if you are matching against many, long strings in each client or
backend context, you might need to increase the Varnish parameters for
workspace sizes. If the VMOD cannot allocate enough workspace, then a
``VCL_error`` message is emitted, and the match methods as well as
``backref`` will fail. (If you're just using the regexen for matching
and not to capture backrefs, then you might as well just use the
standard VCL operators ``~`` and ``!~``, and save the workspace.)

``backref`` can extract up to 10 subexpressions, in addition to the
full expression indicated by backref 0. If a ``match`` or
``match_dyn`` operation would have resulted in more than 11 captures
(10 substrings and the full string), then a ``VCL_Error`` message is
emitted to the Varnish log, and the captures are limited to 11.

SEE ALSO
========

* varnishd(1)
* vcl(7)
* pcre(3)
* source repository: https://code.uplex.de/uplex-varnish/libvmod-re


COPYRIGHT
=========

::

  Copyright 2014-2023 UPLEX Nils Goroll Systemoptimierung
  All rights reserved
 
  This document is licensed under the same conditions as the libvmod-re
  project. See LICENSE for details.
 
  Authors: Geoffrey Simmons <geoffrey.simmons@uplex.de>
           Nils Goroll <nils.goroll@uplex.de>