Document the benchmark test data.

2c73efba · Geoff Simmons · 2c8730db · 2c73efba
Commit 2c73efba authored Sep 19, 2020 by Geoff Simmons
Hide whitespace changes
Inline Side-by-side

Showing with 99 additions and 0 deletions

README.md src/tests/bench/README.md +99 -0

No files found.
--- a/src/tests/bench/README.md
+++ b/src/tests/bench/README.md
@@ -199,3 +199,102 @@ Example:
 # urlpfx_input.txt, with default 1000 iterations and no shuffling.
 ./bench_qp -m p -i urlpfx_input.txt url.txt
 ```
+
+## Test data
+
+The remaining files in the directory contain sample test data and inputs.
+
+`set.txt` contains 8500 words chosen randomly from the words list
+(`/usr/share/dict/words`).
+
+`inputs.txt` contains the words in `set.txt` repeated four times, and
+8500 random strings, all shuffled randomly. So this benchmark tests PH
+with an 80% hit rate and 20% miss rate:
+
+```
+$ ./bench_ph -i inputs.txt set.txt
+```
+
+`url.txt` simulates a set of URL path prefixes of the form `/<string>`
+-- 500 random choices from the words list, each with a leading `/`.
+
+`urlmatch_input.txt` and `urlpfx_input.txt` can be used as inputs with
+`url.txt` for exact and prefix matches,
+respectively. `urlmatch_inputs.txt` contains the strings in `url.txt`
+repeated four times, and 500 URL prefixes with random
+strings. `urlpfx_input.txt` contains 2500 URL paths with five path
+components, 2000 of which have the same prefixes as in `url.txt`, the
+rest generated randomly.
+
+So these benchmarks test the set in `url.txt` with 80% hit and 20% miss
+rates, for exact matches and prefix matches:
+
+```
+# exact URL matches
+$ ./bench_ph -i urlmatch_input.txt url.txt
+
+# URL prefix matches
+$ ./bench_qp -m p -i urlpfx_input.txt url.txt
+```
+
+The set in `hosts.txt` simulates 500 host names, starting with `www.`
+and ending with the nine most common TLDs, with random choices from
+the words list for the "subdomain" in between.
+
+`hosts_input.txt` contains the strings in `hosts.txt` repeated four
+times, and 500 additional simulated hosts, all shuffled:
+
+```
+# Benchmark 80% hits and 20% misses for Host matches
+$ ./bench_ph -i hosts_input.txt hosts.txt
+```
+
+`moz500.txt` contains the "top 500 domains" from moz.com, downloaded
+on March 10, 2020. `moz500_input.txt` contains the host names in
+`moz500.txt` repeated four times, and 500 additional simulated host
+names, all shuffled.
+
+```
+# Another "80-20" benchmark for Host matches
+$ ./bench_ph -i moz500_input.txt moz500.txt
+```
+
+`methods.txt` contains the nine standard HTTP methods (GET, POST, etc),
+and `methods_input.txt` contains 1000 of the methods, with each of the
+nine in approximately equal distribution, randomly shuffled.
+
+`allowed_methods.txt` contains only GET, HEAD and
+POST. `rest_methods.txt` contains the six standard methods for a REST
+API.
+
+These can be used to benchmark matches against the request methods,
+for example to generate a 405 "Method Not Allowed" synthetic response
+when the method does not match. This would override the logic in
+builtin VCL, which checks the method in `vcl_recv` using a sequence of
+comparisons, and returns pipe if the method is not one of 8 standard
+methods.
+
+```
+# Benchmark matches against the nine standard methods, with 10,000
+# iterations
+$ ./bench_ph -i methods_input.txt -n 10000 methods.txt
+
+# Benchmark matches against only GET, HEAD and POST
+$ ./bench_ph -i methods_input.txt -n 10000 allowed_methods.txt
+
+# Benchmark matches against standard REST methods
+$ ./bench_ph -i methods_input.txt -n 10000 rest_methods.txt
+```
+
+`compressible.txt` contains prefixes for the Content-Type header for
+content that may be compressible. `mediatypes.txt` contains all of the
+IANA standard media types as of June 10, 2020.
+
+So this benchmark simulates running a prefix match against the
+Content-Type header to decide if compression should be applied to a
+backend response:
+
+```
+# Benchmark prefix matches against media types, with 10,000 iterations
+$ ./bench_qp -m p -i mediatypes.txt -n 10000 compressible.txt
+```