Add filters: regex objects to perform substitutions on bodies #8

Merged
slink merged 2 commits from filter into master 2023-06-01 12:33:34 +00:00
slink commented 2023-04-05 20:24:12 +00:00 (Migrated from code.uplex.de)

If the optional asfilter parameter to re.regex() is true, the
vmod registers itself as a Varnish Fetch Processor (VFP) for use in
beresp.filters and as a Varnish Delivery Processor (VDP) for use in
resp.filters. In this setup, the xregex.substitute_match() and
xregex.substitute_all() methods can be used to define replacements
for matches on the body.

Example:

    sub vcl_init {
	new reiher = re.regex("r(ei)h(er)", asfilter = true);
    }
    sub vcl_deliver {
	unset req.http.Accept-Encoding;
	set resp.filters += " reiher";
	reiher.substitute_match(1, "czapla");
	reiher.substitute_match(2, "\1\2");
	reiher.substitute_match(0, "heron");
    }
If the optional ``asfilter`` parameter to ``re.regex()`` is true, the vmod registers itself as a Varnish Fetch Processor (VFP) for use in `beresp.filters` and as a Varnish Delivery Processor (VDP) for use in `resp.filters`. In this setup, the `xregex.substitute_match()` and `xregex.substitute_all()` methods can be used to define replacements for matches on the body. Example: sub vcl_init { new reiher = re.regex("r(ei)h(er)", asfilter = true); } sub vcl_deliver { unset req.http.Accept-Encoding; set resp.filters += " reiher"; reiher.substitute_match(1, "czapla"); reiher.substitute_match(2, "\1\2"); reiher.substitute_match(0, "heron"); }
slink commented 2023-04-05 20:57:29 +00:00 (Migrated from code.uplex.de)

added 1 commit

Compare with previous version

added 1 commit <ul><li>4e493c55 - Update README</li></ul> [Compare with previous version](https://code.uplex.de/uplex-varnish/libvmod-re/merge_requests/3/diffs?diff_id=88&start_sha=3ff525e81ffdc233af23fdb1a7a111ca72f71d47)
slink commented 2023-04-06 16:06:08 +00:00 (Migrated from code.uplex.de)

added 2 commits

  • cae4eaab - Add filters: regex objects to perform substitutions on bodies
  • 4420ca08 - Update README

Compare with previous version

added 2 commits <ul><li>cae4eaab - Add filters: regex objects to perform substitutions on bodies</li><li>4420ca08 - Update README</li></ul> [Compare with previous version](https://code.uplex.de/uplex-varnish/libvmod-re/merge_requests/3/diffs?diff_id=89&start_sha=4e493c5518df1ce76907c07673900b2f32fd75ab)
slink commented 2023-04-17 18:19:23 +00:00 (Migrated from code.uplex.de)

assigned to @geoff

assigned to @geoff
slink commented 2023-05-02 12:15:56 +00:00 (Migrated from code.uplex.de)

added 5 commits

  • 4420ca08...446a5900 - 2 commits from branch uplex-varnish:master
  • bffebc04 - Add filters: regex objects to perform substitutions on bodies
  • f06afb07 - Update README
  • d0bceb53 - Remove assertion on buffer length > 0

Compare with previous version

added 5 commits <ul><li>4420ca08...446a5900 - 2 commits from branch <code>uplex-varnish:master</code></li><li>bffebc04 - Add filters: regex objects to perform substitutions on bodies</li><li>f06afb07 - Update README</li><li>d0bceb53 - Remove assertion on buffer length &gt; 0</li></ul> [Compare with previous version](https://code.uplex.de/uplex-varnish/libvmod-re/merge_requests/3/diffs?diff_id=90&start_sha=4420ca0869a042bbb33bda67d4792040741949ae)
slink commented 2023-05-29 14:15:26 +00:00 (Migrated from code.uplex.de)

added 5 commits

  • d0bceb53...2c683cd4 - 3 commits from branch uplex-varnish:master
  • f8fb9d5c - Add filters: regex objects to perform substitutions on bodies
  • a25e5b63 - Update README

Compare with previous version

added 5 commits <ul><li>d0bceb53...2c683cd4 - 3 commits from branch <code>uplex-varnish:master</code></li><li>f8fb9d5c - Add filters: regex objects to perform substitutions on bodies</li><li>a25e5b63 - Update README</li></ul> [Compare with previous version](https://code.uplex.de/uplex-varnish/libvmod-re/merge_requests/3/diffs?diff_id=91&start_sha=d0bceb533f9823c89a0f4fe7de59708c0f1725c4)
geoff commented 2023-06-01 11:30:45 +00:00 (Migrated from code.uplex.de)

Sorry, the previous comment was invalid, I was looking at the wrong branch.

Sorry, the previous comment was invalid, I was looking at the wrong branch.
geoff commented 2023-06-01 12:33:20 +00:00 (Migrated from code.uplex.de)

@slink I think the code is fine. I'll press the green button, and that's all that needs to be said for an MR review.

A few thoughts that might be the subject of further documentation, or at least to consider moving forward.

Do I understand correctly that the memory management that the filters might do (malloc, realloc, memmove, etc) could happen because a partial match might extend across discontinuous buffers? And that the bet is that this will be relatively uncommon, especially since a matched text is unlikely to extend across several buffers at a time?

It would be interesting to know how well the filters perform when the memory management becomes substantial. And to just see what happens on a pathological case, say when a match extends across more than 32K.

This and match_body have reminded me of why the re2 library was originally developed -- they were going to try Google code search with PCRE, but bad cases of backtracking became very severe when matches were run against whole files.

PCRE(2) is optimized for strings about as long as header contents, or a URL, or a line in a text file. Even if the regex is not optimally efficient, it still doesn't (usually) get that bad on a short string. When users of the VMOD start running matches against request and response bodies, performance issues might become noticeable, especially for users who aren't skilled at writing efficient regexen (which is unfortunately much more common than I like).

Should we add some kind of note about that subject in the documentation?

AFAICT the re2 library doesn't have something like PCRE2_PARTIAL, so I don't see a way to do something similar in VMOD re2.

@slink I think the code is fine. I'll press the green button, and that's all that needs to be said for an MR review. A few thoughts that might be the subject of further documentation, or at least to consider moving forward. Do I understand correctly that the memory management that the filters might do (malloc, realloc, memmove, etc) could happen because a partial match might extend across discontinuous buffers? And that the bet is that this will be relatively uncommon, especially since a matched text is unlikely to extend across several buffers at a time? It would be interesting to know how well the filters perform when the memory management becomes substantial. And to just see what happens on a pathological case, say when a match extends across more than 32K. This and match_body have reminded me of why the re2 library was originally developed -- they were going to try Google code search with PCRE, but bad cases of backtracking became very severe when matches were run against whole files. PCRE(2) is optimized for strings about as long as header contents, or a URL, or a line in a text file. Even if the regex is not optimally efficient, it still doesn't (usually) get that bad on a short string. When users of the VMOD start running matches against request and response bodies, performance issues might become noticeable, especially for users who aren't skilled at writing efficient regexen (which is unfortunately much more common than I like). Should we add some kind of note about that subject in the documentation? AFAICT the re2 library doesn't have something like PCRE2_PARTIAL, so I don't see a way to do something similar in VMOD re2.
geoff commented 2023-06-01 12:33:34 +00:00 (Migrated from code.uplex.de)

merged

merged
geoff commented 2023-06-01 12:33:34 +00:00 (Migrated from code.uplex.de)

mentioned in commit c963694419

mentioned in commit c963694419f0836122d9ebd59846507f58b94a5b
slink commented 2023-06-05 09:33:32 +00:00 (Migrated from code.uplex.de)

@geoff thank you for the review and merge.

malloc / realloc

I actually did ponder this a bit: This is similar to the partial match with match_iter_f(), but there we can reserve the workspace (because we are inside a VMOD function/method call from VCL), which, IIUC, we can not for the filter, because other filters could also request workspace.

So with match_iter_f(), one could run into out-of-workspace conditions (and I wonder if also using malloc() there would be the better option), while with the filter, we could request arbitrarily big buffers on the heap.

At least the code uses memmove() to avoid alloc/free cycles except for one case.

because a partial match might extend across discontinuous buffers?

Yes. When pcre tells us there is a partial match at the end of the buffer, we need to save that and call pcre again when we have more data or "EOF" at the next filter invocation.

And to just see what happens on a pathological case

I did spend quite some time with these pathological cases, but by artificially decreasing the buffer length rather than increasing the span of the pcre match. See uplex-varnish/libvmod-re@c963694419/src/vmod_re.c (L1096-1100)

why the re2 library was originally developed

IIUC for now, the RE2 devs dismissed this use case.

Should we add some kind of note about that subject in the documentation?

Feel free!

@geoff thank you for the review and merge. > malloc / realloc I actually did ponder this a bit: This is similar to the partial match with [`match_iter_f()`](https://code.uplex.de/uplex-varnish/libvmod-re/blob/c963694419f0836122d9ebd59846507f58b94a5b/src/vmod_re.c#L455), but there we can reserve the workspace (because we are inside a VMOD function/method call from VCL), which, IIUC, we can not for the filter, because other filters could also request workspace. So with `match_iter_f()`, one could run into out-of-workspace conditions (and I wonder if also using `malloc()` there would be the better option), while with the filter, we could request arbitrarily big buffers on the heap. At least the code uses `memmove()` to avoid alloc/free cycles except for [one case](https://code.uplex.de/uplex-varnish/libvmod-re/blob/c963694419f0836122d9ebd59846507f58b94a5b/src/vmod_re.c#L1144). > because a partial match might extend across discontinuous buffers? Yes. When pcre tells us there is a partial match at the end of the buffer, we need to save that and call pcre again when we have more data or "EOF" at the next filter invocation. > And to just see what happens on a pathological case I did spend quite some time with these pathological cases, but by artificially decreasing the buffer length rather than increasing the span of the pcre match. See https://code.uplex.de/uplex-varnish/libvmod-re/blob/c963694419f0836122d9ebd59846507f58b94a5b/src/vmod_re.c#L1096-1100 > why the re2 library was originally developed IIUC for now, the [RE2 devs dismissed this use case](https://github.com/google/re2/issues/126#issuecomment-267242496). > Should we add some kind of note about that subject in the documentation? Feel free!
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
uplex-varnish/libvmod-re!8
No description provided.