Storage Engines (stevedores) and Routers (loadmasters) for Varnish-Cache
If fellow_cache_obj_readahead() git an FCS_READFAIL segment, assert(ran > n) would fail in fellow_cache_obj_iter_work(). To be able to reliably test this case, we inject the error in fellow_cache_obj_readahead(), which is not ideal, but avoids complications because completions are handled in another thread. The error injection did trigger the assertion without the fix, which is just to move two assertions. Fixes (a possible cause of) #135 |
||
|---|---|---|
| doc | ||
| LICENSES | ||
| logs | ||
| src | ||
| testing | ||
| tools/coccinelle | ||
| .editorconfig | ||
| .gitignore | ||
| bootstrap | ||
| BUGS.rst | ||
| CHANGES.rst | ||
| configure.ac | ||
| COPYING | ||
| DEVELOPER.rst | ||
| INSTALL.rst | ||
| lcovrc | ||
| Makefile.am | ||
| README.rst | ||
===============================================
SLASH/ Storage Engines for Varnish-Cache Master
===============================================
This branch tracks Varnish-Cache Master **after** 8.0 See other
branches if you want to use SLASH/ with other releases of
Varnish-Cache.
.. role:: ref(emphasis)
.. _Varnish-Cache: https://varnish-cache.org/
This project provides storage engines (stevedores) for `Varnish-Cache`_.
PROJECT RESOURCES
=================
* The primary repository is at https://code.uplex.de/uplex-varnish/slash
This server does not accept user registrations, so please use ...
* the mirror at https://gitlab.com/uplex/varnish/slash for issues,
merge requests and all other interactions.
INTRODUCTION
============
Two storage engines are provided by this project:
* `buddy` for in-memory, volatile storage
* `fellow` for persistent storage backed by flash memory / SSDs
(recommended) or even disk drives (not).
Both storage engines implement waiting allocations, which, when LRU
cache eviction is active due to low memory conditions, fairly serve
allocation requests on a first-come first-served basis and thus solve
`The LRU fairness issue`_ of all varnish-cache bundled storage
engines.
.. _vmod_slash.man.rst: src/vmod_slash.man.rst
This README is only intended as a first introduction, for more
details, please refer to the :ref:`vmod_slash(3)` man page. If you are
reading this file on line, it should also be available as
`vmod_slash.man.rst`_.
Basic storage routers (called loadmasters) are provided to facilitate
usage of multiple storage instances.
The :ref:`slashmap(1)` tool provides a non-intrusive live view on the
memory allocations of both storage engines.
buddy
-----
.. _buddy memory allocator: https://en.wikipedia.org/wiki/Buddy_memory_allocation
.. _jemalloc: https://jemalloc.net/
.. _Storage Backend: https://varnish-cache.org/docs/trunk/reference/varnishd.html#storage-backend
The `buddy` storage engine is an advanced, high performance stevedore
with a fixed memory size based on a new `buddy memory allocator`_
implementation from first principles.
In comparison with the malloc storage engine bundled with
varnish-cache, it has the following advantages in addition to solving
the `The LRU fairness issue`_:
* The gross amount of memory is fixed, the storage engine will never
take any more system memory than the configured amount plus a fixed
amount for metadata (typically 0.4% of the configured size).
* Storage allocation always implies fragmentation: Any freed storage
allocation leaves a hole of that size only, unless neighboring
regions also happen to be free.
The `buddy memory allocator`_ implements a simple and efficient way
to join free regions.
In addition, the buddy storage engine's *cram* tuning parameter
offers fine grained control over the tradeoff between fragmentation
and wasted space which the user is willing to accept.
* The buddy storage engine uses the expected expiry time of objects to
preferrably place objects with a similar expiry point in time
nearby, which reduces fragmentation because subsequent object
expirys are more likely to result in a larger free memory area.
Compared to other memory allocators, including `jemalloc`_, which is
recommended for use with Varnish-Cache and, if used, is also the
implementation underlying the Varnish-Cache `malloc` `Storage
Backend`_, the SLASH/ storage engines have the following advantages,
which are expected to lead to comparably lower fragmentation:
* By default, they always align allocations on the allocated size,
rounded up or down to the next power of two, depending on the *cram*
parameter. This will usually leave smaller memory regions free for
other requests, but if most requests are of a certain minimal size,
the smaller "cutoff" is likely to remain free and be merged into a
bigger region when the original allocation is returned.
* With the chunk size (``chunk_bytes`` tunable) set appropriately and
``reserve_chunks`` configured, least recently used objects will be
removed ("nuked" in Varnish-Cache lingo) until the configured number
of chunks of the configured size are available as contiguous
memory/storage regions. This mechanism is primarily intended to
lower the latency of allocation requests, but also acts as a kind of
background de-fragmentation job.
fellow
------
The `fellow` storage engine is an advanced, high performance,
eventually persistent, always consistent implementation based on the
same allocator as the buddy storage engine.
It offers the same features as the buddy storage engine except for the
expiry-time based placement.
In addition, it eventually persists all objects on stable
storage. Both raw devices and files are supported, but NVMe or other
flash based storage with high random I/O throughput at low latency
times is recommended as the underlying medium.
Persisted objects present on the configured storage are loaded at
startup and paged into memory on demand.
Object bodies are stored in segments of configurable size, which are
paged into memory and LRU'ed independently. This allows, for example,
better cache usage if some region of objects (typically the beginning)
is accessed more frequently than others, as is usually the case with
audio and video streaming.
.. _xxhash: https://github.com/Cyan4973/xxHash
Storage is managed with an always consistent log (given that the
underlying medium respects write order).
All stored data is checksummed and checksums are verified with each
read. If this module is built with the recommended `xxhash`_ support,
highly efficient checksum algorithms are available which should not
impact performance in any significant way (the default is ``xxh3_64``,
also called *XXH3* on the `xxhash`_ website).
Checksum algorithms can be controlled as well as behavior when
checksums mismatch.
Asynchronous I/O is used whenever possible for maximum throughput and
minimal latencies.
Discard requests can be issued for freed storage regions to optimize
wear and performance on flash based storage.
*Side note:* Despite the recommendation to not actually deploy disk
drives, throughout the fellow documentation and code, the stable
storage ("storage which is not RAM") is referred to as "disk" for
simplicity, abbreviated as "dsk".
IO INTERFACES
=============
As a memory-based storage, the `buddy` storage engine has no IO
interface requirements and does not issue any IO.
.. _liburing: https://github.com/axboe/liburing
The `fellow` storage has been designed and implemented to make heavy
use of asynchronous IO whenever deemed advantageous and has been
developed primarily for the Linux io_uring interface through
`liburing`_.
Alternative IO interfaces have also been implemented, but with less
focus on performance:
.. _Solaris AIO: https://smartos.org/man/3c/aiowait
* An implementation based on `Solaris AIO`_ as a proof of concept
* An implementation based on Varnish-Cache workers as a fallback
INSTALLATION
============
.. _INSTALL.rst: https://code.uplex.de/uplex-varnish/slash/blob/master/INSTALL.rst
See `INSTALL.rst`_.
SUPPORT
=======
.. _gitlab.com issues: https://gitlab.com/uplex/varnish/slash/-/issues
To report bugs, use `gitlab.com issues`_.
For enquiries about professional service and support, please contact
info@uplex.de\ .
CONTRIBUTING
============
.. _merge requests on gitlab.com: https://gitlab.com/uplex/varnish/slash/-/merge_requests
To contribute to the project, please use `merge requests on gitlab.com`_.
To support the project's development and maintenance, there are
several options:
.. _paypal: https://www.paypal.com/donate/?hosted_button_id=BTA6YE2H5VSXA
.. _github sponsor: https://github.com/sponsors/nigoroll
* Donate money through `paypal`_. If you wish to receive a commercial
invoice, please add your details (address, email, any requirements
on the invoice text) to the message sent with your donation.
* Become a `github sponsor`_.
* Contact info@uplex.de to receive a commercial invoice for SWIFT payment.
KNOWN LIMITATIONS
=================
If you are interested in supporting work to lift these known
limitations, please consider supporting the project (see above).
.. _#21: https://gitlab.com/uplex/varnish/slash/-/issues/21
* The IO subsystem is configured at compile time and there is no
runtime fallback. For example, if io_uring is configured at compile
time, but not available at runtime, SLASH/fellow will fail rather
than falling back to an alternative IO subsystem. (`#21`_)
* SLASH/fellow data on persistent storage is stored in native byte
order, so storage can not be transported to systems with a different
byte order and loaded there.
Note that the disk layout is prepared to support endianness
conversion, so this limitation is also one which would "just" need a
sponsor to remove.
* The last access time to objects (internally called ``last_lru``) is not
persisted, it gets reset to the time the cache is loaded. Consequently, the
``obj.last_hit`` ban attribute can not be used to select objects based on a
last hit time before a restart.
PERFORMANCE
===========
For recommendations on performance tuning, see `INSTALL.rst`_.
This project has been designed to use the full potential of the
hardware available to it and we will treat any relevant sub-optimal
hardware use as a bug, as long as the performance tuning
recommendations are followed - in particular, as long as io_uring and
xxhash are being used.
Benchmark
---------
.. _wrk: https://github.com/wg/wrk
On a test system
* HPE ProLiant DL380 Gen10 Plus
* 512GiB RAM
* 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
* 8x Samsung PM173X NVMe SSD
configured with
* 26TB storage on a RAID-0 raw device over the SSDs
* 32GB RAM cache using 1GB huge pages
under
* Ubuntu 20.04.6 LTS (focal)
* kernel 5.15.0-94-generic
and benchmarked with
* uniformly distributed object sizes from 0 to 4 MiB
* consequently, objsize_hint = 2MB
* `wrk`_ requests to localhost over http
* the tuning parameters::
chunk_bytes = 2MB,
hash_obj=xxh3_64,
hash_log=xxh3_64,
mem_reserve_chunks=512,
discard_immediate=2MB
fellow typically outperforms the following minimum performance
metrics:
* 50 - 60 GiB/s for memory cache hits
* 14 GiB/s for cache misses on an empty storage
* 7 GiB/s for cache misses on a full cache with constant LRU and
discard requests
These figures are in Gibibytes per second as reported by varnishstat
in `wrk`_ and are not to be confused with figures in Gigabits per
second.
.. _dev-03.rst: doc/perf/dev-03.rst
See `dev-03.rst`_ for details.
ACKNOWLEDGEMENTS
================
First and foremost, we thank our unnamed sponsor for financing the
development of SLASH/.
Other contributors deserving honorary mention are:
* Shohei Tanaka aka @xcir for excellent bug reporting
(to be continued)
ADDITIONAL BACKGROUND
=====================
Talks
-----
.. _talks: doc/talks/talks.rst
I gave some `talks`_ about SLASH/.
The LRU fairness issue
----------------------
.. _2480: https://github.com/varnishcache/varnish-cache/issues/2480#issuecomment-342186669
One of the problems which is present in all storage engines bundled
with varnish-cache, and which the storage engines in this module
solve, is that LRU eviction as implemented in varnish-cache is unfair
and might not even work. This issue is known for long and documented
in ticket `2480`_ and others:
Each time a thread requiring storage encounters an allocation failure,
it triggers LRU-eviction of an existing object until retrying the
allocation request either succeeds, or the number of retries reaches
``lru_limit``, in which case the request requiring storage ultimately
fails with a ``FetchError``.
The first issue with this strategy is that objects in varnish-cache
can subtantially differ in size and thus, to store a large object,
eviction of a very high number of small objects might be necessary to
make room for a large object. For example, storing a 10GB object could
require eviction of roughly one million 10KB objects - a number much
higher than sensible ``lru_limit`` settings.
The second issue is the actual fairness issue: After one thread has
made room by nuking objects, any other thread might issue an
allocation and grab the memory. Thus, backend fetches might fail even
when, in principle, the space freed by LRU nuking would have been
sufficient.
The storage engines in this module solve the issue with a fair queue
for waiting allocation requests whenever memory is tight. The actual
LRU is implemented in a separate thread and whenever memory becomes
available, waiting requests are served in order.
Also, to minimize waiting times when LRU is active, the LRU thread
maintains a reserve: It pre-nukes objects until it has filled up the
reserve, which is then used first whenever storage is requested.
TRIVIA
======
Names
-----
The project name *SLASH/* is an acronym of something boastful and
"Storage Hierarchy", intended as a reminiscence to the UNIX root of
the file system, the leftmost slash.
`buddy` was named after the memory allocation technique of the `buddy
memory allocator`_.
The name `fellow` was chosen to have a similar meaning as "buddy", but
for someone (in this case something) you can rely on long term.