Rework fellow_cache_obj_iter and read ahead

Issue #41 has shown a deadlock scenario where various object iterators would wait for memory. While reviewing this issue, we noticed a couple of shortcomings in the existing code: * fellow_cache_seg_ref_in() would always wait for allocation requests for readahead segments. Yet, when under memory pressure, we should not wait at all for memory for readahead. * fellow_cache_obj_iter() would hold onto already sent segments also when waiting for synchronous I/O and memory allocations. To improve on these shortcomings and further optimize the code, some of fellow_cache_obj_iter() and all of the readahead code has been rewritten. Improvements comprise the following: * For read ahead, we now use asynchronous memory allocations. If they succeed right away, we issue I/O right away also, but if allocations are delayed, we continue delivery and check back later. By chance, memory allocations will succeed until then. * We decouple memory allocations from specific segments and only care about the right size of the allocation. Because many segments will be of chunk_bytes size, this will allow more efficient use of available asynchronous allocations. * We now de-reference already sent segments also whenever we need to wait for anything, be it a memory allocation or I/O. This should help overall efficiency and reduce memory pressure, because already sent segments can be LRUd earlier. The drawback is that we flush the VDP pipeline more often (we need to before we can deref segments). We also cap the readahead parameter at the equivalent of 1/16 of memory in order to avoid inefficiencies because of single requests holding too much of the memory cache hostage. An additional hard cap at 31 is required to keep the default esi depth supported with the default stack size of varnish-cache.

Rework fellow_cache_obj_iter and read ahead
Issue #41 has shown a deadlock scenario where various object iterators would wait for memory. While reviewing this issue, we noticed a couple of shortcomings in the existing code: * fellow_cache_seg_ref_in() would always wait for allocation requests for readahead segments. Yet, when under memory pressure, we should not wait at all for memory for readahead. * fellow_cache_obj_iter() would hold onto already sent segments also when waiting for synchronous I/O and memory allocations. To improve on these shortcomings and further optimize the code, some of fellow_cache_obj_iter() and all of the readahead code has been rewritten. Improvements comprise the following: * For read ahead, we now use asynchronous memory allocations. If they succeed right away, we issue I/O right away also, but if allocations are delayed, we continue delivery and check back later. By chance, memory allocations will succeed until then. * We decouple memory allocations from specific segments and only care about the right size of the allocation. Because many segments will be of chunk_bytes size, this will allow more efficient use of available asynchronous allocations. * We now de-reference already sent segments also whenever we need to wait for anything, be it a memory allocation or I/O. This should help overall efficiency and reduce memory pressure, because already sent segments can be LRUd earlier. The drawback is that we flush the VDP pipeline more often (we need to before we can deref segments). We also cap the readahead parameter at the equivalent of 1/16 of memory in order to avoid inefficiencies because of single requests holding too much of the memory cache hostage. An additional hard cap at 31 is required to keep the default esi depth supported with the default stack size of varnish-cache.
e8d54546 · Nils Goroll · c8b01760 · e8d54546 · e8d54546 · e8d54546
Unverified Commit e8d54546 authored Dec 10, 2023 by Nils Goroll
6 changed files
--- a/CHANGES.rst
+++ b/CHANGES.rst
@@ -25,6 +25,19 @@ fellow

 * Improved code coverage and added Coverity for additional linting.

+* Added an absolute maximum of 31 and dynamic maximum to the readahead
+  parameter to avoid single object deliveries holding more than 1/16
+  of the available memory cache.
+
+* The readahead implementation has been changed to only run when less
+  than or equal to half (rounded down) the configured read ahead
+  segments are already available.
+
+* The ``readahead`` parameter default has been changed from 2 to 5 to
+  enable the efficiency improvement by the aforementioned change: As 5
+  / 2 = 2, read ahead will trigger for every 2 segments, instead of
+  for every segment.
+
 * Added a dynamic minimum to the dsk_reserve_chunks parameter to
  always keep the reserve at 2MB minimum. This is required for stable
  operation of LRU when the log is full.

--- a/src/fellow_cache.c
+++ b/src/fellow_cache.c
--- a/src/fellow_tune.c
+++ b/src/fellow_tune.c
@@ -88,10 +88,21 @@ stvfe_tune_check(struct stvfe_tune *tune)
 	l = (unsigned)sz;
 	if (tune->mem_reserve_chunks > l) {
 		fprintf(stderr,"fellow: mem_reserve_chunks limited to %u "
-		    "(less than 1/8 of memory size)\n", l);
+		    "(less than 1/8 of memory size per lru)\n", l);
 		tune->mem_reserve_chunks = l;
 	}

+	sz = tune->memsz >> (tune->chunk_exponent + 4);
+	if (tune->readahead > sz) {
+		assert(sz <= UINT_MAX);
+		l = (unsigned)sz;
+		fprintf(stderr,"fellow: readahead limited to "
+		    "%u chunks * %zu chunk_bytes (%u chunk_exponent)"
+		    " be less than 1/16 of memory\n",
+		    l, (size_t)1 << tune->chunk_exponent, tune->chunk_exponent);
+		tune->readahead = l;
+	}
+
 	// 2MB
 	if (tune->chunk_exponent < 21U) {
 		l = 1U << (21U - tune->chunk_exponent);

--- a/src/tbl/fellow_tunables.h
+++ b/src/tbl/fellow_tunables.h
@@ -48,7 +48,8 @@ TUNE(unsigned, mem_reserve_chunks, 1, 0, UINT_MAX);
 TUNE(size_t, objsize_hint, 256 * 1024, 4096, SIZE_MAX);
 TUNE(size_t, objsize_max, 0, 0, SIZE_MAX);
 TUNE(size_t, discard_immediate, 256 * 1024, 4096, SIZE_MAX);
-TUNE(unsigned, readahead, 2, 0, UINT_MAX);
+// 31 is safe max for stack usage, further limited by memsz
+TUNE(unsigned, readahead, 5, 0, 31);
 TUNE(unsigned, io_batch_min, 8, 1, UINT_MAX);
 // right now, the io ring size is hardcoded to 1024, so 512 is half that
 TUNE(unsigned, io_batch_max, 512, 1, UINT_MAX);

--- a/src/vmod_slash.man.rst
+++ b/src/vmod_slash.man.rst
@@ -672,14 +672,29 @@ fellow storage can be fine tuned:
 * *readahead*

  - unit: scalar
-  - default: 2
+  - default: 5
  - minimum: 0
+  - maximum: 31 or 1/16th of *memsize*

  specifies how many additional segments of an object's body should be
  staged into memory asynchronously before being required. This
  parameter helps keeping response times low and throughput high for
  objects which are not already present in the memory cache.

+  The maximum is the lower of 31 or the value corresponding to 1/16th
+  of *memsize* divided by *chunk_bytes*.
+
+  Read ahead triggers whenever the number of read ahead segments is at
+  readahead / 2 (rounded down) or less. Thus, for the default value of
+  5, read ahead will, after the initial read of 5 segments, read 2
+  segments whenever 2 segments have been sent.
+
+  Note that, on a system with a decently sized memory cache, no disk
+  IO will happen for most requests. When segments are still in memory
+  cache, read ahead only references them. Disk IO is only needed for
+  segments which are accessed for the first time after a cache load or
+  LRU eviction.
+
 * *discard_immediate*

  - unit: bytes

--- a/src/vmod_slash.vcc
+++ b/src/vmod_slash.vcc
@@ -608,14 +608,29 @@ fellow storage can be fine tuned:
 * *readahead*

  - unit: scalar
-  - default: 2
+  - default: 5
  - minimum: 0
+  - maximum: 31 or 1/16th of *memsize*

  specifies how many additional segments of an object's body should be
  staged into memory asynchronously before being required. This
  parameter helps keeping response times low and throughput high for
  objects which are not already present in the memory cache.

+  The maximum is the lower of 31 or the value corresponding to 1/16th
+  of *memsize* divided by *chunk_bytes*.
+
+  Read ahead triggers whenever the number of read ahead segments is at
+  readahead / 2 (rounded down) or less. Thus, for the default value of
+  5, read ahead will, after the initial read of 5 segments, read 2
+  segments whenever 2 segments have been sent.
+
+  Note that, on a system with a decently sized memory cache, no disk
+  IO will happen for most requests. When segments are still in memory
+  cache, read ahead only references them. Disk IO is only needed for
+  segments which are accessed for the first time after a cache load or
+  LRU eviction.
+
 * *discard_immediate*

  - unit: bytes