Commit 72e8490b authored by Geoff Simmons's avatar Geoff Simmons

Document the benchmark test data.

parent 9244e0f7
......@@ -199,3 +199,102 @@ Example:
# urlpfx_input.txt, with default 1000 iterations and no shuffling.
./bench_qp -m p -i urlpfx_input.txt url.txt
```
## Test data
The remaining files in the directory contain sample test data and inputs.
`set.txt` contains 8500 words chosen randomly from the words list
(`/usr/share/dict/words`).
`inputs.txt` contains the words in `set.txt` repeated four times, and
8500 random strings, all shuffled randomly. So this benchmark tests PH
with an 80% hit rate and 20% miss rate:
```
$ ./bench_ph -i inputs.txt set.txt
```
`url.txt` simulates a set of URL path prefixes of the form `/<string>`
-- 500 random choices from the words list, each with a leading `/`.
`urlmatch_input.txt` and `urlpfx_input.txt` can be used as inputs with
`url.txt` for exact and prefix matches,
respectively. `urlmatch_inputs.txt` contains the strings in `url.txt`
repeated four times, and 500 URL prefixes with random
strings. `urlpfx_input.txt` contains 2500 URL paths with five path
components, 2000 of which have the same prefixes as in `url.txt`, the
rest generated randomly.
So these benchmarks test the set in `url.txt` with 80% hit and 20% miss
rates, for exact matches and prefix matches:
```
# exact URL matches
$ ./bench_ph -i urlmatch_input.txt url.txt
# URL prefix matches
$ ./bench_qp -m p -i urlpfx_input.txt url.txt
```
The set in `hosts.txt` simulates 500 host names, starting with `www.`
and ending with the nine most common TLDs, with random choices from
the words list for the "subdomain" in between.
`hosts_input.txt` contains the strings in `hosts.txt` repeated four
times, and 500 additional simulated hosts, all shuffled:
```
# Benchmark 80% hits and 20% misses for Host matches
$ ./bench_ph -i hosts_input.txt hosts.txt
```
`moz500.txt` contains the "top 500 domains" from moz.com, downloaded
on March 10, 2020. `moz500_input.txt` contains the host names in
`moz500.txt` repeated four times, and 500 additional simulated host
names, all shuffled.
```
# Another "80-20" benchmark for Host matches
$ ./bench_ph -i moz500_input.txt moz500.txt
```
`methods.txt` contains the nine standard HTTP methods (GET, POST, etc),
and `methods_input.txt` contains 1000 of the methods, with each of the
nine in approximately equal distribution, randomly shuffled.
`allowed_methods.txt` contains only GET, HEAD and
POST. `rest_methods.txt` contains the six standard methods for a REST
API.
These can be used to benchmark matches against the request methods,
for example to generate a 405 "Method Not Allowed" synthetic response
when the method does not match. This would override the logic in
builtin VCL, which checks the method in `vcl_recv` using a sequence of
comparisons, and returns pipe if the method is not one of 8 standard
methods.
```
# Benchmark matches against the nine standard methods, with 10,000
# iterations
$ ./bench_ph -i methods_input.txt -n 10000 methods.txt
# Benchmark matches against only GET, HEAD and POST
$ ./bench_ph -i methods_input.txt -n 10000 allowed_methods.txt
# Benchmark matches against standard REST methods
$ ./bench_ph -i methods_input.txt -n 10000 rest_methods.txt
```
`compressible.txt` contains prefixes for the Content-Type header for
content that may be compressible. `mediatypes.txt` contains all of the
IANA standard media types as of June 10, 2020.
So this benchmark simulates running a prefix match against the
Content-Type header to decide if compression should be applied to a
backend response:
```
# Benchmark prefix matches against media types, with 10,000 iterations
$ ./bench_qp -m p -i mediatypes.txt -n 10000 compressible.txt
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment