Storage Engines (stevedores) and Routers (loadmasters) for Varnish-Cache
Find a file
Nils Goroll 73f1e15048
fellow_cache: Fix handling of and inject-test FCS_READFAIL in iterator
If fellow_cache_obj_readahead() git an FCS_READFAIL segment, assert(ran > n)
would fail in fellow_cache_obj_iter_work().

To be able to reliably test this case, we inject the error in
fellow_cache_obj_readahead(), which is not ideal, but avoids complications
because completions are handled in another thread.

The error injection did trigger the assertion without the fix, which is just to
move two assertions.

Fixes (a possible cause of) #135
2025-11-05 12:04:20 +01:00
doc doc: Add talks materials 2025-02-09 16:43:58 +01:00
LICENSES Initial public release 2023-02-06 07:54:50 +01:00
logs Retry flock() 3 times with 100ms delay inbetween 2023-10-26 15:46:09 +02:00
src fellow_cache: Fix handling of and inject-test FCS_READFAIL in iterator 2025-11-05 12:04:20 +01:00
testing testing: graph new stats 2025-07-25 17:42:59 +02:00
tools/coccinelle fellow_cache: replace fdsl_sz access by macro 2024-10-14 14:15:54 +02:00
.editorconfig editconfig 2023-06-23 17:41:27 +02:00
.gitignore fellow_log: fix struct bitfalloc sizing 2025-11-04 11:27:36 +01:00
bootstrap Initial public release 2023-02-06 07:54:50 +01:00
BUGS.rst doc: Improve recommendations regarding PII / sensitive information 2025-02-25 14:08:21 +01:00
CHANGES.rst Tag version 1.0.0-rc3 2025-07-29 17:44:19 +02:00
configure.ac Back to trunk 2025-07-29 17:47:56 +02:00
COPYING Initial public release 2023-02-06 07:54:50 +01:00
DEVELOPER.rst Start a developer notes document 2023-07-09 15:15:16 +02:00
INSTALL.rst doc: polish typo 2025-05-25 14:42:29 +02:00
lcovrc build: more gcov silencing 2024-12-11 14:34:33 +01:00
Makefile.am Handle src/vmod_vcs_version.txt 2025-04-16 07:57:32 +02:00
README.rst 8.0 branched off 2025-11-03 09:21:27 +01:00

===============================================
SLASH/ Storage Engines for Varnish-Cache Master
===============================================

This branch tracks Varnish-Cache Master **after** 8.0 See other
branches if you want to use SLASH/ with other releases of
Varnish-Cache.

.. role:: ref(emphasis)

.. _Varnish-Cache: https://varnish-cache.org/

This project provides storage engines (stevedores) for `Varnish-Cache`_.

PROJECT RESOURCES
=================

* The primary repository is at https://code.uplex.de/uplex-varnish/slash

  This server does not accept user registrations, so please use ...

* the mirror at https://gitlab.com/uplex/varnish/slash for issues,
  merge requests and all other interactions.

INTRODUCTION
============

Two storage engines are provided by this project:

* `buddy` for in-memory, volatile storage

* `fellow` for persistent storage backed by flash memory / SSDs
  (recommended) or even disk drives (not).

Both storage engines implement waiting allocations, which, when LRU
cache eviction is active due to low memory conditions, fairly serve
allocation requests on a first-come first-served basis and thus solve
`The LRU fairness issue`_ of all varnish-cache bundled storage
engines.

.. _vmod_slash.man.rst: src/vmod_slash.man.rst

This README is only intended as a first introduction, for more
details, please refer to the :ref:`vmod_slash(3)` man page. If you are
reading this file on line, it should also be available as
`vmod_slash.man.rst`_.

Basic storage routers (called loadmasters) are provided to facilitate
usage of multiple storage instances.

The :ref:`slashmap(1)` tool provides a non-intrusive live view on the
memory allocations of both storage engines.

buddy
-----

.. _buddy memory allocator: https://en.wikipedia.org/wiki/Buddy_memory_allocation

.. _jemalloc: https://jemalloc.net/

.. _Storage Backend: https://varnish-cache.org/docs/trunk/reference/varnishd.html#storage-backend

The `buddy` storage engine is an advanced, high performance stevedore
with a fixed memory size based on a new `buddy memory allocator`_
implementation from first principles.

In comparison with the malloc storage engine bundled with
varnish-cache, it has the following advantages in addition to solving
the `The LRU fairness issue`_:

* The gross amount of memory is fixed, the storage engine will never
  take any more system memory than the configured amount plus a fixed
  amount for metadata (typically 0.4% of the configured size).

* Storage allocation always implies fragmentation: Any freed storage
  allocation leaves a hole of that size only, unless neighboring
  regions also happen to be free.

  The `buddy memory allocator`_ implements a simple and efficient way
  to join free regions.

  In addition, the buddy storage engine's *cram* tuning parameter
  offers fine grained control over the tradeoff between fragmentation
  and wasted space which the user is willing to accept.

* The buddy storage engine uses the expected expiry time of objects to
  preferrably place objects with a similar expiry point in time
  nearby, which reduces fragmentation because subsequent object
  expirys are more likely to result in a larger free memory area.

Compared to other memory allocators, including `jemalloc`_, which is
recommended for use with Varnish-Cache and, if used, is also the
implementation underlying the Varnish-Cache `malloc` `Storage
Backend`_, the SLASH/ storage engines have the following advantages,
which are expected to lead to comparably lower fragmentation:

* By default, they always align allocations on the allocated size,
  rounded up or down to the next power of two, depending on the *cram*
  parameter. This will usually leave smaller memory regions free for
  other requests, but if most requests are of a certain minimal size,
  the smaller "cutoff" is likely to remain free and be merged into a
  bigger region when the original allocation is returned.

* With the chunk size (``chunk_bytes`` tunable) set appropriately and
  ``reserve_chunks`` configured, least recently used objects will be
  removed ("nuked" in Varnish-Cache lingo) until the configured number
  of chunks of the configured size are available as contiguous
  memory/storage regions. This mechanism is primarily intended to
  lower the latency of allocation requests, but also acts as a kind of
  background de-fragmentation job.

fellow
------

The `fellow` storage engine is an advanced, high performance,
eventually persistent, always consistent implementation based on the
same allocator as the buddy storage engine.

It offers the same features as the buddy storage engine except for the
expiry-time based placement.

In addition, it eventually persists all objects on stable
storage. Both raw devices and files are supported, but NVMe or other
flash based storage with high random I/O throughput at low latency
times is recommended as the underlying medium.

Persisted objects present on the configured storage are loaded at
startup and paged into memory on demand.

Object bodies are stored in segments of configurable size, which are
paged into memory and LRU'ed independently. This allows, for example,
better cache usage if some region of objects (typically the beginning)
is accessed more frequently than others, as is usually the case with
audio and video streaming.

.. _xxhash: https://github.com/Cyan4973/xxHash

Storage is managed with an always consistent log (given that the
underlying medium respects write order).

All stored data is checksummed and checksums are verified with each
read. If this module is built with the recommended `xxhash`_ support,
highly efficient checksum algorithms are available which should not
impact performance in any significant way (the default is ``xxh3_64``,
also called *XXH3* on the `xxhash`_ website).

Checksum algorithms can be controlled as well as behavior when
checksums mismatch.

Asynchronous I/O is used whenever possible for maximum throughput and
minimal latencies.

Discard requests can be issued for freed storage regions to optimize
wear and performance on flash based storage.

*Side note:* Despite the recommendation to not actually deploy disk
drives, throughout the fellow documentation and code, the stable
storage ("storage which is not RAM") is referred to as "disk" for
simplicity, abbreviated as "dsk".

IO INTERFACES
=============

As a memory-based storage, the `buddy` storage engine has no IO
interface requirements and does not issue any IO.

.. _liburing: https://github.com/axboe/liburing

The `fellow` storage has been designed and implemented to make heavy
use of asynchronous IO whenever deemed advantageous and has been
developed primarily for the Linux io_uring interface through
`liburing`_.

Alternative IO interfaces have also been implemented, but with less
focus on performance:

.. _Solaris AIO: https://smartos.org/man/3c/aiowait

* An implementation based on `Solaris AIO`_ as a proof of concept

* An implementation based on Varnish-Cache workers as a fallback

INSTALLATION
============

.. _INSTALL.rst: https://code.uplex.de/uplex-varnish/slash/blob/master/INSTALL.rst

See `INSTALL.rst`_.

SUPPORT
=======

.. _gitlab.com issues: https://gitlab.com/uplex/varnish/slash/-/issues

To report bugs, use `gitlab.com issues`_.

For enquiries about professional service and support, please contact
info@uplex.de\ .

CONTRIBUTING
============

.. _merge requests on gitlab.com: https://gitlab.com/uplex/varnish/slash/-/merge_requests

To contribute to the project, please use `merge requests on gitlab.com`_.

To support the project's development and maintenance, there are
several options:

.. _paypal: https://www.paypal.com/donate/?hosted_button_id=BTA6YE2H5VSXA

.. _github sponsor: https://github.com/sponsors/nigoroll

* Donate money through `paypal`_. If you wish to receive a commercial
  invoice, please add your details (address, email, any requirements
  on the invoice text) to the message sent with your donation.

* Become a `github sponsor`_.

* Contact info@uplex.de to receive a commercial invoice for SWIFT payment.

KNOWN LIMITATIONS
=================

If you are interested in supporting work to lift these known
limitations, please consider supporting the project (see above).

.. _#21: https://gitlab.com/uplex/varnish/slash/-/issues/21

* The IO subsystem is configured at compile time and there is no
  runtime fallback. For example, if io_uring is configured at compile
  time, but not available at runtime, SLASH/fellow will fail rather
  than falling back to an alternative IO subsystem. (`#21`_)

* SLASH/fellow data on persistent storage is stored in native byte
  order, so storage can not be transported to systems with a different
  byte order and loaded there.

  Note that the disk layout is prepared to support endianness
  conversion, so this limitation is also one which would "just" need a
  sponsor to remove.

* The last access time to objects (internally called ``last_lru``) is not
  persisted, it gets reset to the time the cache is loaded. Consequently, the
  ``obj.last_hit`` ban attribute can not be used to select objects based on a
  last hit time before a restart.

PERFORMANCE
===========

For recommendations on performance tuning, see `INSTALL.rst`_.

This project has been designed to use the full potential of the
hardware available to it and we will treat any relevant sub-optimal
hardware use as a bug, as long as the performance tuning
recommendations are followed - in particular, as long as io_uring and
xxhash are being used.

Benchmark
---------

.. _wrk: https://github.com/wg/wrk

On a test system

* HPE ProLiant DL380 Gen10 Plus
* 512GiB RAM
* 2x Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
* 8x Samsung PM173X NVMe SSD

configured with

* 26TB storage on a RAID-0 raw device over the SSDs
* 32GB RAM cache using 1GB huge pages

under

* Ubuntu 20.04.6 LTS (focal)
* kernel 5.15.0-94-generic

and benchmarked with

* uniformly distributed object sizes from 0 to 4 MiB

  * consequently, objsize_hint = 2MB

* `wrk`_ requests to localhost over http

* the tuning parameters::

    chunk_bytes = 2MB,
    hash_obj=xxh3_64,
    hash_log=xxh3_64,
    mem_reserve_chunks=512,
    discard_immediate=2MB

fellow typically outperforms the following minimum performance
metrics:

* 50 - 60 GiB/s for memory cache hits

* 14 GiB/s for cache misses on an empty storage

*  7 GiB/s for cache misses on a full cache with constant LRU and
   discard requests

These figures are in Gibibytes per second as reported by varnishstat
in `wrk`_ and are not to be confused with figures in Gigabits per
second.

.. _dev-03.rst: doc/perf/dev-03.rst

See `dev-03.rst`_ for details.

ACKNOWLEDGEMENTS
================

First and foremost, we thank our unnamed sponsor for financing the
development of SLASH/.

Other contributors deserving honorary mention are:

* Shohei Tanaka aka @xcir for excellent bug reporting

(to be continued)

ADDITIONAL BACKGROUND
=====================

Talks
-----

.. _talks: doc/talks/talks.rst

I gave some `talks`_ about SLASH/.

The LRU fairness issue
----------------------

.. _2480: https://github.com/varnishcache/varnish-cache/issues/2480#issuecomment-342186669

One of the problems which is present in all storage engines bundled
with varnish-cache, and which the storage engines in this module
solve, is that LRU eviction as implemented in varnish-cache is unfair
and might not even work. This issue is known for long and documented
in ticket `2480`_ and others:

Each time a thread requiring storage encounters an allocation failure,
it triggers LRU-eviction of an existing object until retrying the
allocation request either succeeds, or the number of retries reaches
``lru_limit``, in which case the request requiring storage ultimately
fails with a ``FetchError``.

The first issue with this strategy is that objects in varnish-cache
can subtantially differ in size and thus, to store a large object,
eviction of a very high number of small objects might be necessary to
make room for a large object. For example, storing a 10GB object could
require eviction of roughly one million 10KB objects - a number much
higher than sensible ``lru_limit`` settings.

The second issue is the actual fairness issue: After one thread has
made room by nuking objects, any other thread might issue an
allocation and grab the memory. Thus, backend fetches might fail even
when, in principle, the space freed by LRU nuking would have been
sufficient.

The storage engines in this module solve the issue with a fair queue
for waiting allocation requests whenever memory is tight. The actual
LRU is implemented in a separate thread and whenever memory becomes
available, waiting requests are served in order.

Also, to minimize waiting times when LRU is active, the LRU thread
maintains a reserve: It pre-nukes objects until it has filled up the
reserve, which is then used first whenever storage is requested.

TRIVIA
======

Names
-----

The project name *SLASH/* is an acronym of something boastful and
"Storage Hierarchy", intended as a reminiscence to the UNIX root of
the file system, the leftmost slash.

`buddy` was named after the memory allocation technique of the `buddy
memory allocator`_.

The name `fellow` was chosen to have a similar meaning as "buddy", but
for someone (in this case something) you can rely on long term.