Add back man page

it got lost in 2b239f08
parent 270c0905
..
.. NB: This file is machine generated, DO NOT EDIT!
..
.. Edit ./vmod_slash.vcc and run make instead
..
.. role:: ref(emphasis)
==========
vmod_slash
==========
---------------------------------------------------------------------------------
Varnish-Cache SLASH/ stevedores (buddy, fellow) and loadmasters (storage routers)
---------------------------------------------------------------------------------
:Manual section: 3
SYNOPSIS
========
global storages
---------------
* Make ``vmod_slash`` available::
varnishd -E /path/to/vmod_slash.so
* Configure a buddy (memory) storage::
varnishd -s<name>=buddy,<size>[,<minpage>]
* Configure a fellow (persistent disk with memory cache) storage::
varnishd -s<name>=fellow,<path>,<dsksize>,<memsize>[=<storage>],<objsize_hint>
vcl storage objects and methods
-------------------------------
.. parsed-literal::
import slash [as name] [from "path"]
new xbuddy = slash.buddy(BYTES size, BYTES minpage)
STRING xbuddy.tune([INT chunk_exponent], [BYTES chunk_bytes], [INT reserve_chunks], [INT cram])
STEVEDORE xbuddy.storage()
VOID xbuddy.as_transient()
new xfellow = slash.fellow(STRING filename, BYTES dsksize, BYTES memsize, BYTES objsize_hint, BOOL delete)
STRING xfellow.tune([INT logbuffer_size], [DURATION logbuffer_flush_interval], [REAL log_rewrite_ratio], [INT chunk_exponent], [BYTES chunk_bytes], [INT dsk_reserve_chunks], [INT mem_reserve_chunks], [BYTES objsize_max], [ INT objsize_lw_exponent ], [ INT objsize_hw_exponent ], [INT cram], [INT readahead], [BYTES discard_immediate], [INT io_batch_min], [INT io_batch_max], [ENUM hash_obj], [ENUM hash_log], [ENUM ioerr_obj], [ENUM ioerr_log], [ENUM allocerr_obj], [ENUM allocerr_log])
STEVEDORE xfellow.storage()
VOID xfellow.as_transient()
vcl functions
-------------
.. parsed-literal::
import slash [as name] [from "path"]
VOID as_transient(STEVEDORE)
STRING tune_buddy(STEVEDORE storage, [INT chunk_exponent], [BYTES chunk_bytes], [INT reserve_chunks], [INT cram])
STRING tune_fellow(STEVEDORE storage, [INT logbuffer_size], [DURATION logbuffer_flush_interval], [REAL log_rewrite_ratio], [INT chunk_exponent], [BYTES chunk_bytes], [INT dsk_reserve_chunks], [INT mem_reserve_chunks], [BYTES objsize_max], [ INT objsize_lw_exponent ], [ INT objsize_hw_exponent ], [INT cram], [INT readahead], [BYTES discard_immediate], [INT io_batch_min], [INT io_batch_max], [ENUM hash_obj], [ENUM hash_log], [ENUM ioerr_obj], [ENUM ioerr_log], [ENUM allocerr_obj], [ENUM allocerr_log])
vcl loadmasters (storage routers)
---------------------------------
.. parsed-literal::
import slash [as name] [from "path"]
new xloadmaster_rr = slash.loadmaster_rr()
VOID xloadmaster_rr.add_storage(STEVEDORE)
STEVEDORE xloadmaster_rr.storage()
EXAMPLES
========
* Configure a global buddy (memory only) storage of 1 GB named ``mem``::
varnishd -E /path/to/libvmod_slash.so \
-s mem=buddy,1g
Use this storage with VCL code like this::
sub vcl_backend_response {
set beresp.storage = storage.mem;
}
sub vcl_backend_error {
set beresp.storage = storage.mem;
}
# ... more of your own VCL code
* Configure two global fellow (persistent, disk-backed) storages,
* one named ``fast`` of 1TB on a raw device
``/dev/mapper/ssd-volume`` using 100GB memory cache with an
expected object size of 10MB, and
* one named ``slow`` of 10TB on a file ``/hugefs/varnish-storage``,
which shares the memory cache with the ``fast`` storage and also
has the same expected object size::
varnishd -E /path/to/libvmod_slash.so \
-s fast=fellow,/dev/mapper/ssd-volume,1TB,100GB,10MB \
-s slow=fellow,/hugefs/varnish-storage,10TB,100GB=fast,10MB
Use these storages with VCL code, where responses to requests on
paths beginning with ``/archive/`` go to the ``slow`` storage::
sub vcl_backend_response {
if (bereq.url ~ "^/archive/") {
set beresp.storage = storage.slow;
}
else {
set beresp.storage = storage.fast;
}
}
* Configure a round-robin storage router in VCL::
# assumes that storages A .. C have been defined globaly
sub vcl_init {
new storageX = slash.loadmaster_rr();
storageX.add_storage(storage.A);
storageX.add_storage(storage.B);
storageX.add_storage(storage.C);
}
and use it::
sub vcl_backend_response {
set beresp.storage = rr.storage();
}
DESCRIPTION
===========
.. _buddy_memory_allocator: https://en.wikipedia.org/wiki/Buddy_memory_allocation
.. _README.rst: https://code.uplex.de/uplex-varnish/slash/blob/master/README.rst
.. _INSTALL.rst: https://code.uplex.de/uplex-varnish/slash/blob/master/INSTALL.rst
This module can be used both as a varnish extension (VEXT) and a
VCL module (VMOD).
It provides the two storage engines `buddy` and `fellow`, which can be
configured at ``varnishd`` startup and, with limitations, from VCL.
The `buddy` storage engine is an advanced, high performance stevedore
with a fixed memory size based on a new `buddy_memory_allocator`_
implementation from first principles.
The `fellow` storage engine is an advanced, high performance, eventually
persistent, always consistent implementation based on the same
allocator as the buddy storage engine.
See `README.rst`_ for more details.
Installation instructions can be found in `INSTALL.rst`_.
STORAGE VEXT INTERFACES
=======================
The two storage engines `buddy` and `fellow` should preferably be
configured globally by loading ``vmod_slash.so`` through the
``varnishd -E`` option and adding global storages with ``-s`` as shown
in `SYNOPSIS`_.
buddy
-----
For `buddy`, the ``-s`` parameter syntax is::
-s<name>=buddy,<size>[,<minpage>]
with
* *<name>* being a given name for the storage instance, which will
become available from vcl as ``storage.``\ *<name>*,
* *<size>* being a size expression like ``100m`` or ``5g`` for the
storage size to be configured,
* the optional *<minpage>* argument being a size expression for the
minimal allocation unit of the storage instance. See
`slash.buddy()`_ for details.
A global `buddy` storage can be tuned from VCL using
`slash.tune_buddy()`_ with ``storage.``\ *<name>* as the first
argument.
fellow
------
For `fellow`, the ``-s`` parameter syntax is::
-s<name>=fellow,<path>,<dsksize>,<memsize>[=<storage>],<objsize_hint>
with
* *<name>* being a given name for the storage instance, which will
become available from vcl as ``storage.``\ *<name>*,
* *<path>* being the path to the storage file or device,
Permissions and ownership of *path* are changed during startup using
the Varnish-Cache `jail`_ facility.
* *<dsksize>* being a size expression like ``100m`` or ``5g`` for
the storage size to be configured,
* *<memsize>* being a size expression for the memory cache size to
be configured,
* optionally, *<storage>* being the name of a previously defined
fellow storage to share the memory cache with, and
* *<objsize_hint>* being a size expression for the expected average
object size with which the storage instance is being used.
See `slash.fellow()`_ for additional details.
A global `fellow` storage can be tuned from VCL using
`slash.tune_fellow()`_ with ``storage.``\ *<name>* as the first
argument.
Memory Cache Sharing
~~~~~~~~~~~~~~~~~~~~
When memory cache sharing with the ``<memsize>[=<storage>]`` syntax is
configured, *<memsize>* is ignored. The actual memory size is always
that of the referenced storage.
LRU with memory cache sharing is cooperative. Whenever memory is
needed by any storage, all storages using the shared cache are asked
to make room. Consequently, more frequently used storages are likely
to keep more of the shared memory cache.
STORAGE VMOD INTERFACES
=======================
.. _slash.buddy():
new xbuddy = slash.buddy(BYTES size, BYTES minpage=64)
------------------------------------------------------
Create or reference a buddy storage of size *size* with the given vmod
object name. The storage will remain in existence as long as
- any loaded VCL has an object by that name
- there are objects using it
The *minpage* argument can be used to define the smallest possible
allocation unit. The default and lowest possible *minpage* argument is
64B. The *minpage* argument will be rounded up to the next power of
two. Larger *minpage* arguments improve efficiency at the cost of
memory overhead.
The *size* argument will be rounded down to a multiple of the
(possibly rounded) *minpage* argument.
Besides the configured memory size, approximately 1 / ( *minpage* *
4) of it is additionally required for metadata (bitmaps) in the
varnish home directory and in memory. For the default *minpage* of 64
Bytes, this amounts to approximately 0.4%. The actual figure is output
at startup as ``buddy: metadata (bitmap) size``.
This storage can *not* be used via ``storage.``\ *<name>*.
If the last vcl using this vmod is discarded before the storage is
empty, all its memory will remain allocated until a varnish restart.
.. _xbuddy.tune():
STRING xbuddy.tune([INT chunk_exponent], [BYTES chunk_bytes], [INT reserve_chunks], [INT cram])
-----------------------------------------------------------------------------------------------
::
STRING xbuddy.tune(
[INT chunk_exponent],
[BYTES chunk_bytes],
[INT reserve_chunks],
[INT cram]
)
)
Using the `xbuddy.tune()` method, the following parameters of the
buddy storage can be fine tuned:
* *chunk_exponent* / *chunk_bytes*
- unit: bytes as a power of two / bytes
- default: 20 / 1 MB
- minimum: 6 / 64 B
- maximum: 28 / 256 MB
*chunk_bytes* and *chunk_exponent* are alternative ways to configure
the chunk size. If *chunk_bytes* is used, the value is rounded up to
the next power of two and used as if *chunk_exponent* was used with
the 2-logarithm of that value.
Using both arguments at the same time triggers a VCL error.
*chunk_exponent* / *chunk_bytes* are very similar to the
``fetch_maxchunksize`` varnishd parameter, but can be configured per
storage instance: They specify the maximum contiguous memory region
which the storage will return for a single allocation request. The
default is the smaller of 1/16 the *size* of the storage and
256MB. The smallest possible value is 1/4 the *size* of the storage
and rounded down to the previous power of two.
* *reserve_chunks*
- unit: scalar
- default: 1
- minimum: 0
specifies a number of chunks to reserve in memory. The reserve is
used to immediately fulfill requests while LRU cache eviction is
running: When the cache is full, allocation requests need to wait
until LRU eviction has made room, and the reserve can help reduce
latencies in these situations at the expense of some memory
unavailable for caching.
* *cram*
- unit: powers of two
- default: 1
- minimum: -64
- maximum: 64
specifies to which extent the allocator should return regions
smaller than requested when it would need to wait for LRU to make
room.
Its unit is powers of two, valid values are -64 to 64, but sensible
values are much smaller.
* cram = 0: Always allocate the requested size
* cram != 0: Also return abs(*cram*) powers of two less than the
roundup of the requested size.
For example, with a *cram* value of 1 (the default) or -1, for 129
to 255 bytes requested, also 128 bytes could be returned.
For a *cram* value of 2 or -2, also 64 bytes could be returned for
129 to 255 bytes requested.
* For positive *cram* value, page splits are avoided - that is, if a
larger memory region would need to be split to fulfill all of the
request, but a memory region that is up to *cram* powers of two
smaller is available, the smaller memory region is returned.
* A negative *cram* value means that smaller memory regions are only
returned if the request could not be fulfilled otherwise.
Higher absolute *cram* values generally lead to higher fragmentation
in return for less unused space. Higher fragmentation is generally
bad for performance.
.. _xbuddy.storage():
STEVEDORE xbuddy.storage()
--------------------------
Return the the buddy storage. Can be used to set it for storing a
backend response::
set beresp.storage = mybuddy.storage();
.. _xbuddy.as_transient():
VOID xbuddy.as_transient()
--------------------------
Set this buddy storage as the transient storage.
Restricted to: ``vcl_init``.
.. _slash.tune_buddy():
STRING tune_buddy(STEVEDORE storage, [INT chunk_exponent], [BYTES chunk_bytes], [INT reserve_chunks], [INT cram])
-----------------------------------------------------------------------------------------------------------------
::
STRING tune_buddy(
STEVEDORE storage,
[INT chunk_exponent],
[BYTES chunk_bytes],
[INT reserve_chunks],
[INT cram]
)
)
Tune the given globally defined fellow storage, for all other
parameters see `xbuddy.tune()`.
.. _slash.fellow():
new xfellow = slash.fellow(STRING path, BYTES dsksize, BYTES memsize, BYTES objsize_hint, BOOL delete)
------------------------------------------------------------------------------------------------------
::
new xfellow = slash.fellow(
STRING path,
BYTES dsksize,
BYTES memsize,
BYTES objsize_hint=262144,
BOOL delete=0
)
Create or reference a fellow storage on *path* of size *dsksize*
with a memory cache of size *memsize*. See `slash_fellow_resize`_
below for information on changing sizes.
A VCL-defined fellow storage can not load persisted objects, so to
avoid accidentally emptying a storage, either the storage referenced
by *path* must be empty, or the *delete* argument must be ``true``.
*path* has to be either a regular file, or a block device. If *path*
does not exist, it is created as a regular file. Checks on *path* are
conducted in order to not accidentally create or use a file where
block devices reside (e.g. on ``/dev/``). The environment variable
``slash_fellow_options`` can be set to contain ``skip-path-check``
where, for whatever exotic reason, this check needs to be skipped.
.. _jail: https://varnish-cache.org/docs/trunk/reference/varnishd.html#jail
Permissions and ownership on *path* need to be set such that the
``varnishd`` worker process has read/write access (see ``workuser`` in
the `jail`_ option documentation). On a system where ``varnishd``
starts as root with the default unix jail configuration (``vcache``
workuser), the permissions can be set using::
my_fellow_path=... # REPLACE ... with path
chown vcache $my_fellow_path
chmod 600 $my_fellow_path
When a VCL-defined fellow storage goes out of scope because the last
VCL referencing it is discarded, all of its objects are removed from
the cache, but remain on disk. They can be loaded again by configuring
a global fellow storage. *Note* that this this kind of dynamic storage
removal is a new feature first introduced with `fellow` and might not
work perfectly yet.
When it comes to cache sizes, a "too big" generally does not exist -
more cache is always better, but `fellow` only supports a memory cache
of size up to that of the disk cache. For more information, see
`slash_fellow_size`_.
On Linux, the memory cache will be allocated from huge pages, if
available and if *memsize* is larger than a huge page. *memsize* will
then be rounded up to a multiple of the respective huge page size.
Besides the configured memory cache size, approximately 1 / 256 (0.4%)
of *memsize* plus 1 / 16384 (0.006%) of *dsksize* will be required in
the varnish home directory and in memory. For example, for
``dsksize=1t`` and ``memsize=1g``, this amounts to roughly 70MB. The
actual figures are output at startup as ``fellow: metadata (bitmap)
memory``.
*objsize_hint* (default 256KB) is used to sanity check *memsize* in
relation to *dsksize* and to size the fixed log regions. It should be
set to a value **lower** than the expected average object size. If
*memsize* is configured too low with respect to *dsksize* and
*objsize_hint*, a higher *memsize* value will be used (which might
fail if insufficient memory is available).
For an already populated storage, the configured *memsize* is checked
against the minimum amount of memory required for the actual average
object size. If it is too low, fellow will not start and emit a fatal
error with sizing requirements.
*delete* specifies if the storage is to be emptied.
.. _slash_fellow_size:
Sizing fellow storage
~~~~~~~~~~~~~~~~~~~~~
This section is intended to provide guidance on cache sizing by
explaining the overall cache organization and ballpark figures for
object sizes.
A simple, yet fundamental insight is that, with `fellow`, there is no
such thing as "delivering objects directly from disk". While hardware
architectures exist which allow DMA directly from flash storage,
`fellow` implements a "disk" and "memory" tier, with all reads and
writes going through RAM first. This architecture has been shown to be
most efficient both in terms of performance and price/performance, but
it establishes a fundamental principle for sizing: The memory cache
should be big enough to hold all actively/frequently accessed
data. Writes happen to memory, and need to be written to disk before
the memory can be re-used. Reads go into memory, from where data can
be accessed.
Besides the always consistent, eventually persistent log, the central
disk structure is the ``fellow_disk_obj``. It contains the fixed and
variable object attributes defined by Varnish-Cache (most importantly
headers) and pointers to the first body segments. For efficiency (log,
memory) this structure is addressed by a single 64bit value. Because
`fellow` uses a minimum disk block size of 4KB, the object can have
sizes between 4KB and just under 16MB. Under optimal circumstances, a
``fellow_disk_obj`` takes only 4KB, but needs to grow bigger if longer
headers or vary specifications need to be stored.
When read into memory, a companion data structure named
``fellow_cache_obj`` is created. Under ideal circumstances (small
headers), both data structures are made fit into a single 4KB
allocation or even less, but as a rule of thumb, the amount of memory
needed per actively accessed object should be assumed to be 4KB plus
the size of the headers and vary specification. Both
``fellow_disk_obj`` and ``fellow_cache_obj`` remain in memory for as
long as any part of the object is accessed.
The object body is organized in chunks of 2^\ *chunk_exponent* bytes,
called segments. Segments are the smallest I/O units of object bodies
and are lru-cached individually, allowing `fellow` to handle objects
bigger than *memsize*: When an object body is iterated over, up to
*readahead* segments are referenced and, if necessary, asynchronously
read into cache in advance. Segments outside the readahead window,
which are not concurrently accessed by other threads, either reside in
memory on the LRU or only on disk. The amount of disk and memory
storage in addition to the actual data amounts to roughly 64 bytes per
segment on disk and another 64 bytes per segment in memory, organized
in larger units called segment lists, which are sized between 4KB for
63 segments and 4MB for 65534 segments. Segment lists are read
asynchronously and LRU'd together with the respective
``fellow_cache_obj``.
Consequently, the *chunk_bytes* / *chunk_exponent* parameter is chosen
such that a typical object needs only a small number of chunks, which
requires an appropriately sized memory cache: To ensure that the cache
can always move data, the parameter is hard capped at 1/1024 of the
memory cache size, so, for example, for 1MB chunks, a memory cache of
at least 1GB is needed.
Extended attributes (currently only used for ESI data) use a separate
segment, which is only read on demand and also LRU'd with the
respective object.
"Busy" objects going into cache while being fetched from a backend
have the same memory requirements as "finished" objects, but need
another 8KB of memory on top while being created.
To achieve high efficiency and to support Direct I/O, the buddy
allocator used to organize both the disk and memory cache only ever
makes allocations at multiples of the requested size, rounded up to
the next power of two. For this reason, it is normal for
:ref:`slashmap(1)` to show substantial amounts of free memory (like
30-40%) in smaller page sizes below 4KB even if LRU is active.
To summarize, one should assume for memory sizing at least the amount
of data actively accessed, plus 4KB per object, plus 8KB per "busy"
object.
.. _slash_fellow_resize:
Resizing fellow storage
~~~~~~~~~~~~~~~~~~~~~~~
In general, resizing a fellow storage is supported by restarting
varnishd with different parameters (be it on the command line or in
VCL), but for size reductions, cache contents may be lost, to the
extent of all cache contents. Read this paragraph for details.
Before applying any size change, it is strongly recommended to cleanly
shut down fellow using ``varnishadm stop``.
Increasing ``memsize``
Increasing ``memsize`` up to ``dsksize`` should never cause any
issues: Administrators should make sure that the amount of memory is
actually available (which might need additional consideration if huge
pages are used, see `INSTALL.rst`_), change the parameter and restart
:ref:`varnishd(1)`. Configuring ``memsize`` larger than ``dsksize`` is
not supported.
Decreasing ``memsize``
When decreasing ``memsize``, first and foremost consider that
performance might significantly degrade, depending on access
patterns. As a simple rule, it is recommended to only reduce
``memsize`` of an existing cache by halving at most and then letting
the cache contents rotate.
Consider that a dynamic minimum applies to ``memsize`` (see the
paragraph on *objsize_hint* in `slash.fellow()`_), so it can not be
made arbitrarily small. ``memsize`` also caps some tunables (see
`xfellow.tune()`_), of which *chunk_exponent* / *chunk_bytes*
deserve special consideration: At any time of fellow serving
requests for object bodies, some number of chunks needs to fit in
memory. Obviously, fellow can not work if a new ``memsize`` is
chosen too small to fit existing disk chunks. To be on the safe
side, *chunk_exponent* / *chunk_bytes* should thus be reduced to at
most 1 / 1024 of the planned ``memsize`` reduction *before* the
reduction is applied. Then, ideally, all of the cache contents
should be recreated. Keep in mind that smaller chunk sizes are
generally less efficient.
Increasing ``dsksize``
Increasing ``dsksize`` is generally not an issue. Keep in mind that
memory required for metadata and the minimum ``memsize`` will also
increase (see `slash.fellow()`_). It is recommended to increase
``dsksize`` in steps of at least 10% to ensure that free space can
be used to accommodate grown log regions (otherwise objects need to
be removed until enough contiguous space is available).
If the configured storage path points to a file, fellow will make an
attempt to change its size using :ref:`posix_fallocate(3)`. Success
and failure will be reported as ``fellow: ... grown to ...`` or
``fellow: ... warning, fallocate failed ...``.
If the configured storage path points to a block device, the
administrator needs to ensure that it is at least as large as
``dsksize``, or fellow will not start.
Once the storage is loaded, the log regions will be recreated to
accommodate the now higher number of objects possible to store.
Decreasing ``dsksize``
Decreasing ``dsksize`` is also supported and fellow will make an
effort to load as many objects from the shrunken storage as
possible, but it will not move data. That is to say, objects
residing entirely within the shrunken storage region will be loaded,
and others will simply be ignored.
This also applies to the log: If log blocks reside outside the
shrunken storage, the respective objects will not be loaded. Log
regions are reported when fellow starts up, so it is possible to
configure a reduced ``dsksize`` preserving the log, but this is not
a well supported operation. Consider getting professional support if
you require help with such advanced reconfigurations on a regular
basis.
Once a shrunken storage is loaded, the log regions will also be
shrunk according to the now projected number of objects possible to
store.
The actual size change will be applied to files using
:ref:`posix_fallocate(3)` as with increases. The size of block
devices can not be changed by fellow.
.. _xfellow.tune():
STRING xfellow.tune([INT logbuffer_size], [DURATION logbuffer_flush_interval], [REAL log_rewrite_ratio], [INT chunk_exponent], [BYTES chunk_bytes], [INT wait_table_exponent], [INT lru_exponent], [INT dsk_reserve_chunks], [INT mem_reserve_chunks], [BYTES objsize_max], [INT objsize_update_min_log2_ratio], [INT objsize_update_max_log2_ratio], [INT objsize_update_min_occupancy], [INT objsize_update_max_occupancy], [INT cram], [INT readahead], [BYTES discard_immediate], [INT io_batch_min], [INT io_batch_max], [ENUM hash_obj], [ENUM hash_log], [ENUM ioerr_obj], [ENUM ioerr_log], [ENUM allocerr_obj], [ENUM allocerr_log], [INT panic_flags])
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
::
STRING xfellow.tune(
[INT logbuffer_size],
[DURATION logbuffer_flush_interval],
[REAL log_rewrite_ratio],
[INT chunk_exponent],
[BYTES chunk_bytes],
[INT wait_table_exponent],
[INT lru_exponent],
[INT dsk_reserve_chunks],
[INT mem_reserve_chunks],
[BYTES objsize_max],
[INT objsize_update_min_log2_ratio],
[INT objsize_update_max_log2_ratio],
[INT objsize_update_min_occupancy],
[INT objsize_update_max_occupancy],
[INT cram],
[INT readahead],
[BYTES discard_immediate],
[INT io_batch_min],
[INT io_batch_max],
[ENUM {sha256, xxh32, xxh3_64, xxh3_128} hash_obj],
[ENUM {sha256, xxh32, xxh3_64, xxh3_128} hash_log],
[ENUM {panic, purge} ioerr_obj],
[ENUM {panic, fail} ioerr_log],
[ENUM {panic, purge} allocerr_obj],
[ENUM {panic, fail} allocerr_log],
[INT panic_flags]
)
Using the `xfellow.tune()`_ method, the following parameters of the
fellow storage can be fine tuned:
* *logbuffer_size*
- unit: scalar
- default: 24336
- minimum: 28
specifies an approximate number of objects to hold in a
logbuffer. Once a logbuffer is full, it is flushed if possible, so
this parameter constitutes an approximate upper bound on the number
of objects to hold unpersisted.
* *logbuffer_flush_interval*
- unit: duration
- default: 2.0s
- minimum: 0s
specifies the regular interval between regular logbuffer flushes,
persisting objects to disk. Logbuffer flushes can happen more often
if required.
* *log_rewrite_ratio*
- unit: ratio
- default: 0.5
- minimum: 0.001
specifies the minimum ratio of deleted by added objects (n_del / n_add)
in the log which triggers a log rewrite.
* *chunk_exponent* / *chunk_bytes*
- unit: bytes as a power of two / bytes
- default: 20 / 1 MB
- minimum: 12 / 4 KB
- maximum: 28 / 256 MB or <1/1024 of memsize
*chunk_bytes* and *chunk_exponent* are alternative ways to configure
the chunk size. If *chunk_bytes* is used, the value is rounded up to
the next power of two and used as if *chunk_exponent* was used with
the 2-logarithm of that value.
*chunk_bytes* / *chunk_exponent* are hard capped to less than 1/1024
of the memory cache size.
Using both arguments at the same time triggers a VCL error.
See `xbuddy.tune()` for additional details.
* *wait_table_exponent*
TL;DR: 2-logarithm of concurrency for initial reads of objects from
disk.
- unit: wait table entries as a power of two
- default: 10
- minimum: 6
- maximum: 32
When objects are initially read from disk after a cold start or
eviction from memory, condition variables are used to serialize
parallel requests to the same object, similar in effect to the
waitinglist mechanism in Varnish-Cache.
These condition variables are organized in a hash table. This
parameter specifies the 2-logarithm of that table's size.
Two to the power of this value represents an upper limit to the
number of objects read from disk in parallel. The actual limit can
be lower when hash collisions occur. The amount of memory used is
roughly 128 bytes times two to the power of this value.
Note: The wait table only concerns objects initially read from
disk. Once an object is read, its body data is read in parallel
independent of this limit.
* *lru_exponent*
TL;DR: 2-logarithm of number of LRU lists
- unit: number of LRU lists as a power of two
- default: 0
- minimum: 0
- maximum: 6
On large systems, with mostly memory bound access, the LRU
list becomes the main contender as segments are removed and
re-added from/to LRU frequently.
A single LRU (``lru_exponent=0``) is most fair, only the absolute
least recently used segment is eviced ever. But more LRUs reduce
contention on the LRU lists significantly and improve parallelism of
evictions.
* *dsk_reserve_chunks*
- unit: scalar
- default: 4
- minimum: 2 MB / chunk_bytes
- maximum: dsksize / 8 / chunk_bytes
specifies a number of chunks to reserve on disk. The reserve is used
to fulfill storage requests when storage is otherwise full. Because
LRU cache eviction of disk objects is an expensive process involving
disk io, a reserve helps keeping response times for cache misses
low. It is also needed for the LRU algorithm itself, which, when the
fixed log space is full, might momentarily require additional space
before making room.
The value is always raised to a dynamic minimum such that the disk
reserve is at least 2MB.
The value is capped such that the number of reserved chunks times
the chunk size does not exceed 1/8 of the disk size.
* *mem_reserve_chunks*
- unit: scalar
- default: 1
- minimum: 0
- maximum: memsize / 8 / chunk_bytes
specifies a number of chunks to reserve in memory per LRU. The
reserve is used to provide memory for new objects or objects staged
from disk to memory when memory is otherwise full. It can help
reduce latencies in these situations at the expense of some memory
unavailable for caching.
The value is capped such that the number of reserved chunks times
the chunk size does not exceed 1/8 of the memory size.
* *objsize_max*
- unit: bytes
- default: 0
specifies the maxiumum object size which fellow will accept.
The default of ``0`` represents 1/4th of *dsksize*. It is strongly
recommended to not use a value higher than that.
The effectively enforced value is rounded up to 4KB.
* *objsize_update_min_log2_ratio*
- unit: bytes ratio as a power of two
- default: 1
- minimum: 1
- maximum: 64
**This parameter should only be changed if advised by a developer.**
It specifies the minimum binary logarithmic ratio between the
expected object size and the actual average object size to trigger
an update of the expectation.
fellow uses an expected object size to determine the required
capacity of vital data structures, in particular the fixed log
regions on disk. This object size estimate is initially set by the
administrator as the *objsize_hint* parameter (see
`slash.fellow()`_) and then possibly updated based on the actual
size of objects stored.
Updates of the internal *objsize_hint* are important to ensure that
fixed log regions are large enough to hold meta data about all
stored objects. On the other hand, they incur relevant cost because
recreation of fixed log regions may require a high number of cache
objects to be removed in order to free contiguous regions of disk
space.
Thus, the expected object size is only lowered when the rounded-down
2-logarithm of the actual average object size is at least
*objsize_min_log2_ratio* less than the rounded-down 2-logarithm of
the expected object size.
To illustrate: Suppose *objsize_hint* is given as 65KB, but the
actual average object size is 63KB. Then, with the
*objsize_update_min_log2_ratio* default of 1, the internal
*objsize_hint* will be lowered to 32KB. If
*objsize_update_min_log2_ratio* was set to 2, it would remain
unchanged.
* *objsize_update_max_log2_ratio*
- unit: bytes ratio as a power of two
- default: 3
- minimum: 1
- maximum: 64
**This parameter should only be changed if advised by a developer.**
It specifies the maximum binary logarithmic ratio between the actual
average object size and the expected object size to trigger an
update of the expectation.
The parameter concerns the opposite end of
*objsize_update_min_log2_ratio*: When the internal *objsize_hint* is
smaller than the actual average object size, disk space is wasted
for unused fixed log regions. Yet disk space is relatively cheap,
the amount of log space needed per object is relatively low
(typically ~400 to ~900 bytes) and the cost of recreating fixed log
regions is high, so the expected object size should only be
increased for a considerable space saving.
This parameter's default of 3 triggers an increase of the internal
*objsize_hint* only if the rounded down 2-logarithm of the actual
average object size is at least 2^3 = 8 times larger than the
current *objsize_hint*.
* *objsize_update_min_occupancy*
* *objsize_update_max_occupancy*
- unit: percent of disk storage occupied
- default: 25 / 75
- minimum: 0
- maximum: 100
These parameters specify minimum and maximum percent used of disk
space to potentially trigger internal *objsize_hint* updates as
described before. The minimum exists to ensure statistical
significance of the actual average object size value, the maximum is
to avoid costly updates when the storage is highly occupied.
* *cram* is documented in `xbuddy.tune()`_
* *readahead*
- unit: scalar
- default: 5
- minimum: 0
- maximum: 31 or 1/16th of *memsize*
specifies how many additional segments of an object's body should be
staged into memory asynchronously before being required. This
parameter helps keeping response times low and throughput high for
objects which are not already present in the memory cache.
The maximum is the lower of 31 or the value corresponding to 1/16th
of *memsize* divided by *chunk_bytes*.
Read ahead triggers whenever the number of read ahead segments is at
readahead / 2 (rounded down) or less. Thus, for the default value of
5, read ahead will, after the initial read of 5 segments, read 2
segments whenever 2 segments have been sent.
Note that, on a system with a decently sized memory cache, no disk
IO will happen for most requests. When segments are still in memory
cache, read ahead only references them. Disk IO is only needed for
segments which are accessed for the first time after a cache load or
LRU eviction.
* *discard_immediate*
- unit: bytes
- default: 256KB
- minimum: 4KB
minimum size for which to attempt to issue immediate discards of
disk blocks to be freed.
To disable immediate discards, use a number higher than your storage
size. For most users, 42PB will work to disable.
The discard implementation attempts these methods in order:
- ``ioctl(x, BLKDISCARD, ...)``
- ``fallocate(x, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, ...)``
Methods are tried once and disabled upon failure, until a tune
operation is executed which re-enables discard.
If possible, discard commands are issued asynchronously, but they
need to be completed before disk space can be re-used, so discards
can impose additional latency.
Discard operations are skipped when a space deficit exists.
The potential advantage is improved performance and reduced wear on
flash storage.
See :ref:`fallocate(2)` and :ref:`blkdiscard(8)` which contains
related information, because there exists no man page for the
``BLKDISCARD`` :ref:`ioctl(2)`.
* *io_batch_min*, *io_batch_max*
- unit: I/O operations
- default: 8, 512
- minimum: 1
Minimum and maximum number of IO operations to batch within a single
submission to the kernel, where applicable.
Larger values save on system calls, but can increase latency.
* *hash_obj*, *hash_log*
- value: one of ``sha256``, ``xxh32``, ``xxh3_64``, ``xxh3_128``
- default: ``xxh3_64`` if xxhash > 0.8.0 has been compiled in,
``xxh32`` if xxhash > 0.7.3 has been compiled in,
``sha256`` otherwise
*hash_obj* specifies the hash algorithm to ensure data integrity of
objects and their data.
*hash_log* specifies the hash algorithm to ensure data integrity of
the log.
* *ioerr_obj*
- value: ``panic`` or ``purge``
- default: ``panic``
*ioerr_obj* allows to select the action to taken when an IO error
is encountered while reading or writing object data or when a
checksum mismatch is found for object data:
- ``panic`` aborts varnish with a panic
- ``purge`` purges the object from the cache
With ``purge``, consider the following consequences:
* Read errors may lead to delivery of truncated object bodies and/or
other hard delivery errors such as early connection closure.
.. XXX implement .prefetch() from VCL to allow control over it
* Depending on whether or not the object's segment list is present
in RAM, storage may remain allocated until the next restart.
* *ioerr_log*
*NOTE:* As of this release, this feature is not fully
implemented. IO errors may trigger ``panic`` mode even if another
mode is selected.
- value: ``panic`` or ``fail``
- default: ``panic``
*ioerr_log* allows to select the action to taken when an IO error
is encountered while reading or writing the log or when a
checksum mismatch is found for log data:
- ``panic`` aborts varnish with a panic
- ``fail`` causes all allocation requests to the stevedore to fail
(`xfellow.storage()`_ return ``NULL``)
* *allocerr_obj*
- value: ``panic`` or ``purge``
- default: ``panic``
*allocerr_obj* allows to select the action to take when insufficient
memory or storage is available for reading or writing object data:
- ``panic`` aborts varnish with a panic
- ``purge`` purges the object from the cache
For ``purge``, depending on whether or not the object's segment list
is present in RAM, storage may remain allocated until a restart.
Because the fellow storage is designed to not fail allocations under
normal circumstances and instead wait for LRU to make room,
``panic`` is intended also for production use.
* *allocerr_log*
*NOTE:* As of this release, this feature is not fully
implemented. IO errors may trigger ``panic`` mode even if another
mode is selected.
- value: ``panic`` or ``fail``
- default: ``panic``
*allocerr_log* allows to select the action to taken when when insufficient
memory or storage is available for reading or writing the log:
- ``panic`` aborts varnish with a panic
- ``fail`` causes all allocation requests to the stevedore to fail
(`xfellow.storage()`_ return ``NULL``)
Because the fellow storage is designed to not fail allocations under
normal circumstances and instead wait for LRU to make room,
``panic`` is intended also for production use.
* *panic_flags*
Used to increase verbosity of panic messages, read as a bit field.
0x01 : dump full fellow_cache_seg
0x02 : dump full fellow_cache_seglist / fellow_disk_seglist
.. _xfellow.storage():
STEVEDORE xfellow.storage()
---------------------------
Return the the buddy storage. Can be used to set it for storing a
backend response::
set beresp.storage = myfellow.storage();
.. _xfellow.as_transient():
VOID xfellow.as_transient()
---------------------------
Set this fellow storage as the transient storage.
Restricted to: ``vcl_init``.
.. _slash.as_transient():
VOID as_transient(STEVEDORE)
----------------------------
Set this storage as the transient storage.
Restricted to: ``vcl_init``.
.. _slash.tune_fellow():
STRING tune_fellow(STEVEDORE storage, [INT logbuffer_size], [DURATION logbuffer_flush_interval], [REAL log_rewrite_ratio], [INT chunk_exponent], [BYTES chunk_bytes], [INT wait_table_exponent], [INT lru_exponent], [INT dsk_reserve_chunks], [INT mem_reserve_chunks], [BYTES objsize_max], [INT objsize_update_min_log2_ratio], [INT objsize_update_max_log2_ratio], [INT objsize_update_min_occupancy], [INT objsize_update_max_occupancy], [INT cram], [INT readahead], [BYTES discard_immediate], [INT io_batch_min], [INT io_batch_max], [ENUM hash_obj], [ENUM hash_log], [ENUM ioerr_obj], [ENUM ioerr_log], [ENUM allocerr_obj], [ENUM allocerr_log], [INT panic_flags])
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
::
STRING tune_fellow(
STEVEDORE storage,
[INT logbuffer_size],
[DURATION logbuffer_flush_interval],
[REAL log_rewrite_ratio],
[INT chunk_exponent],
[BYTES chunk_bytes],
[INT wait_table_exponent],
[INT lru_exponent],
[INT dsk_reserve_chunks],
[INT mem_reserve_chunks],
[BYTES objsize_max],
[INT objsize_update_min_log2_ratio],
[INT objsize_update_max_log2_ratio],
[INT objsize_update_min_occupancy],
[INT objsize_update_max_occupancy],
[INT cram],
[INT readahead],
[BYTES discard_immediate],
[INT io_batch_min],
[INT io_batch_max],
[ENUM {sha256, xxh32, xxh3_64, xxh3_128} hash_obj],
[ENUM {sha256, xxh32, xxh3_64, xxh3_128} hash_log],
[ENUM {panic, purge} ioerr_obj],
[ENUM {panic, fail} ioerr_log],
[ENUM {panic, purge} allocerr_obj],
[ENUM {panic, fail} allocerr_log],
[INT panic_flags]
)
Tune the given globally defined fellow storage, for all other
parameters see `xfellow.tune()`_.
STATISTICS / COUNTERS
=====================
`buddy` and `fellow` expose statistics and counters which can be
observed with VSC clients like :ref:`varnishstat(1)`.
The counter documentation is available through :ref:`varnishstat(1)`
and the :ref:`slash-counters(7)` man page.
The ``g_dsk_*`` and ``g_mem_*`` gauges are updated at regular
intervals of *logbuffer_flush_interval*.
Interpreting Gauges and Background on Cache Behavior
----------------------------------------------------
The gauges ``g_mem_space`` and ``g_mem_space`` give the number of free
bytes in memory and on disk, the ``*_bytes`` statistics give the
number of used bytes.
On a typical system which uses all of the available cache and evicts
objects mostly through LRU, these gauges should more or less stabilize
over time, which should become obvious when logging and graphing the
above values over longer time spans. But depending on how the cache is
used and tuned, that point might well be in the region of 70% and
below.
The fact that `fellow` does not, by default, attempt to use each and
every byte of the available cache is a deliberate decision:
To achieve optimal disk and network I/O throughput, object data should
be stored in contiguous regions. However, such a region might not
always be available, and `fellow` needs to make a decision if
returning a smaller region or waiting for LRU to make room is the
better option. Also, it might be better to return a smaller region
than to split a larger region, which could instead be used for a
larger object coming in later.
The *cram* parameter controls this trade off: If *cram* allows a
smaller segment, it is returned, otherwise the allocator needs to wait
for LRU to make room.
While higher absolute *cram* values improve space usage, they lead to
higher fragmentation and might negatively impact performance. Positive
*cram* values avoid using larger free regions for smaller
requests. Negative *cram* values do not.
See `xbuddy.tune()`_ for additional explanations on *cram*, tuning for
`fellow` happens through `xfellow.tune()`_.
Another factor is that the LRU algorithm pre-evicts segments and
objects from cache until ``mem_reserve_chunks`` have been reserved
The important aspect here is that the reserved chunks are contiguous
in order to counteract fragmentation: LRU runs until there happens to
be enough contiguous space for each of the reserved chunks.
The smaller objects are compared to the chunk size, the more objects
need to be evicted for a contiguous chunk to become available.
This behavior can be controlled by adjusting ``chunk_exponent`` /
``chunk_bytes``. We recommend to set the chunk size larger than the
expected object size such that typical new objects will fit into
reserved chunks. However, if the goal is to maximize ram cache usage,
the chunk size can be reduced at the expense of somehow higher I/O
overhead and fragmentation.
The higher ``reserve_chunks`` is set, the more agressively LRU will
pre-evict objects in order to have space available for new requests.
FELLOW DIAGNOSTICS
==================
`fellow` writes diagnostic information about initialization, the
initial load and log rewrites to :ref:`vsl(7)`.
To extract the relevant information, query the log in raw mode for
lines with tag ``Storage`` and no vxid (``vxid == 0``), as for example
with :ref:`varnishlog(1)`::
varnishlog -t off -g raw -i Storage -q 'vxid == 0'
During startup, additional diagnostic information is written to
standard error (stderr).
Explanation of some commonly seen startup errors:
* ``open(...) failed: Permission denied``
Permissions on the storage path are not set correctly. See
`slash.fellow()` for how to set them.
* ``... is not a fellow file``
The first 4KB of the storage path are neither zero nor written by
fellow. This is a safeguard in order to avoid overwriting
potentially precious data. Either recreate the file/device or
overwrite the first 4KB with zeroes (as always, entirely at your own
risk, replace ``...`` with the fellow path)::
dd if=/dev/zero of=... bs=4096 count=1
FELLOW CACHE LOADING
====================
Upon :ref:`varnishd(1)` startup with a globally configured `fellow`,
the log is read to recreate all persisted object sparsely as *vampire
objects* (that is, only minimal metadata is added to the cache).
Until `fellow` is fully initialized and the cache loaded, the varnish
instance remains unusable. This is because free space on the storage
is implicitly defined as not being used by any object. Further
improvements of the initial load time might be possible, though.
To wait for cache loading to complete, the following methods can be
used:
* Wait for the ``FELLOW.<name>.b_happy`` bitfield from
:ref:`slash-counters(7)` to become non-zero.
* Wait for the ``storage.<name>.happy`` VCL variable to become true.
Cache loading can be observed using the folowing methods:
* By observing the :ref:`varnish-counters(7)` ``MAIN.n_objectcore``
and ``MAIN.n_vampireobject``. Note that to see the latter with
:ref:`varnishstat(1)` in interactive mode, the ``v`` key needs to be
pressed to select at least ``DIAG`` verbosity.
* By running :ref:`slashmap(1)` to observe how the disk space shown as
allocated fills up as the log is processed.
* By running the :ref:`varnishlog(1)` command given under `FELLOW
DIAGNOSTICS`_. It will continiously display updates on the number of
loaded objects like in this example::
...
0 Storage - fellow fellow: resurrected 8231700
0 Storage - fellow fellow: resurrected 8416700
...
When loading is complete, a summary will be shown like::
0 Storage - ... done: 53.485482s
0 Storage - fellow fellow: first load t1 = 0.154415
0 Storage - fellow fellow: 10010027 resurrected in 53.485892s (187160.720410/s), 431 already expired
FELLOW PLANNED BUT MISSING FEATURES
===================================
The following features are planned for implementation:
* Support some successor of xkey (additional cache keys)
* Further improve cache loading speed
Please see `README.rst`_ for how to support the project in order to
get them implemented.
FELLOW KNOWN ISSUES
===================
* With `fellow` storage on XFS, spurious read errors - most likely
short reads - have been observed. While short reads are technically
legal to happen, handling them would complicate the `fellow` code
substantially, so a fix is currently not planned. If you require a
fix, please support the project and let us know, see _CONTRIBUTING_
in `README.rst`_ for details.
For best performance, it is recommended to use `fellow` storage on a
raw device.
* On Linux with ``io_uring``, by default, `fellow` registers all of
the memory cache as buffers using
:ref:`io_uring_register_buffers(3)` to achieve optimal performance
at runtime, if supported by the system. Where supported, this
enables *zero-copy* IO, where the hardware performs DMA directly
into the `fellow` memory cache.
Buffer registrations happen in multiple threads in parallel, one for
each io ring.
During initialization, however, this takes considerable amounts of
time for larger memory caches.
If this is an issue for you, please ask the kernel developers to
make buffer registration more efficient.
If you are willing to sacrifice runtime performance for a faster
startup, :ref:`varnishd(1)` can be started with the environment
variable ``slash_fellow_options`` set to contain
``skip-uring-register-buffers``.
If the variable contains ``sync-uring-register-buffers``, buffer
registration is forced to be done as serial, syncronous registration
operations.
Note that even with registered buffers, ``io_uring`` has nothing to
do with how the `fellow` memory cache and LRU on it work.
* Bug 3940_ causes :ref:`varnishd(1)` to hang if storage
initialization takes longer than the ``cli_timeout``.
For varnish-cache versions with the fix 3941_, set
``startup_timeout`` to a duration sufficient for `fellow` startup,
e.g. add to the :ref:`varnishd(1)` arguments::
-p startup_timeout=3600
For varnish-cache versions without this fix, set ``cli_timeout``
instead, e.g. add to the :ref:`varnishd(1)` arguments::
-p cli_timeout=3600
.. _3940: https://github.com/varnishcache/varnish-cache/issues/3940
.. _3941: https://github.com/varnishcache/varnish-cache/pull/3941
* Because `fellow` might use varnish threads for some or all IOs and
those might be issued in huge bursts, the infamous *Worker Pool
Queue does not move* panic is more likely to occur when there is
otherwise no problem. It is thus recommended to set the
``thread_pool_watchdog`` parameter to a value significantly higher
than the default, e.g. by adding to the :ref:`varnishd(1)`
arguments::
-p 'thread_pool_watchdog=600'
FELLOW ADDITIONAL TUNING KNOBS
==============================
These options are not expected to ever require tuning, but exist just
in case:
* The environment variable ``fellow_log_io_entries`` can be used to
set the log io ring size, which is configured when the storage
engine starts. The default is 1024, values below 128 are not
generally recommended, and for higher values, the stack size will
likely need to be adjusted or stack overflows might occur.
Three leased log IO rings are used for reading and writing log data.
* Likewise, the environment variable ``fellow_cache_io_entries`` can
be used to set the cache io ring size.
A single shared IO ring is used for reading and writing object data.
Both options affect all IO backends, but in different ways:
* For io_uring, they set the submission and completion ring sizes,
which, simply put, define the maximum number of IOs to be handled
through a single system call. With io_uring, this specifically does
not affect the maximum number of IOs "in flight".
* For the other IO backends, they define the maximum number of IOs "in
flight".
LOADMASTER VMOD INTERFACES
==========================
We call storage routers loadmasters because they coordinate
stevedores.
.. _slash.loadmaster_rr():
new xloadmaster_rr = slash.loadmaster_rr()
------------------------------------------
Defines a round-robin loadmaster which allocates objects from
associated storages in turn. If the preferred, round-robin selected
storage fails, other storages are tried in order until one succeeds,
if at all.
For performance reasons, the implementation does not serialize
requests, so concurrent requests might receive object allocations from
the same backend momentarily. This effect should average out.
.. _xloadmaster_rr.add_storage():
VOID xloadmaster_rr.add_storage(STEVEDORE)
------------------------------------------
Add a storage to the loadmaster.
Restricted to: ``vcl_init``.
.. _xloadmaster_rr.storage():
STEVEDORE xloadmaster_rr.storage()
----------------------------------
Return a reference to the loadmaster, mostly for use with ``set
beresp.backend = loadmaster.storage()``.
.. _slash.loadmaster_hash():
new xloadmaster_hash = slash.loadmaster_hash()
----------------------------------------------
Defines a hashing loadmaster which selects the preferred storage by
taking the first four bytes of the object's hash key (basically
``req.hash``) modulo the number of storages defined.
As with `slash.loadmaster_rr()`_, if the preferred storage fails,
other storages are tried in order until one succeeds, if at all.
.. _xloadmaster_hash.add_storage():
VOID xloadmaster_hash.add_storage(STEVEDORE)
--------------------------------------------
Same as `xloadmaster_rr.add_storage()`_.
Restricted to: ``vcl_init``.
.. _xloadmaster_hash.storage():
STEVEDORE xloadmaster_hash.storage()
------------------------------------
Same as `xloadmaster_rr.storage()`_.
SEE ALSO
========
:ref:`vcl(7)`, :ref:`varnishd(1)`
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment