Rework fellow_cache_obj_iter and read ahead

Issue #41 has shown a deadlock scenario where various object iterators
would wait for memory.

While reviewing this issue, we noticed a couple of shortcomings in the
existing code:

* fellow_cache_seg_ref_in() would always wait for allocation requests
  for readahead segments. Yet, when under memory pressure, we should
  not wait at all for memory for readahead.

* fellow_cache_obj_iter() would hold onto already sent segments also
  when waiting for synchronous I/O and memory allocations.

To improve on these shortcomings and further optimize the code, some
of fellow_cache_obj_iter() and all of the readahead code has been
rewritten. Improvements comprise the following:

* For read ahead, we now use asynchronous memory allocations. If they
  succeed right away, we issue I/O right away also, but if allocations
  are delayed, we continue delivery and check back later. By chance,
  memory allocations will succeed until then.

* We decouple memory allocations from specific segments and only care
  about the right size of the allocation. Because many segments will
  be of chunk_bytes size, this will allow more efficient use of
  available asynchronous allocations.

* We now de-reference already sent segments also whenever we need to
  wait for anything, be it a memory allocation or I/O. This should
  help overall efficiency and reduce memory pressure, because already
  sent segments can be LRUd earlier.

  The drawback is that we flush the VDP pipeline more often (we need
  to before we can deref segments).

We also cap the readahead parameter at the equivalent of 1/16 of
memory in order to avoid inefficiencies because of single requests
holding too much of the memory cache hostage.

An additional hard cap at 31 is required to keep the default esi depth
supported with the default stack size of varnish-cache.
parent c8b01760
......@@ -25,6 +25,19 @@ fellow
* Improved code coverage and added Coverity for additional linting.
* Added an absolute maximum of 31 and dynamic maximum to the readahead
parameter to avoid single object deliveries holding more than 1/16
of the available memory cache.
* The readahead implementation has been changed to only run when less
than or equal to half (rounded down) the configured read ahead
segments are already available.
* The ``readahead`` parameter default has been changed from 2 to 5 to
enable the efficiency improvement by the aforementioned change: As 5
/ 2 = 2, read ahead will trigger for every 2 segments, instead of
for every segment.
* Added a dynamic minimum to the dsk_reserve_chunks parameter to
always keep the reserve at 2MB minimum. This is required for stable
operation of LRU when the log is full.
......
This diff is collapsed.
......@@ -88,10 +88,21 @@ stvfe_tune_check(struct stvfe_tune *tune)
l = (unsigned)sz;
if (tune->mem_reserve_chunks > l) {
fprintf(stderr,"fellow: mem_reserve_chunks limited to %u "
"(less than 1/8 of memory size)\n", l);
"(less than 1/8 of memory size per lru)\n", l);
tune->mem_reserve_chunks = l;
}
sz = tune->memsz >> (tune->chunk_exponent + 4);
if (tune->readahead > sz) {
assert(sz <= UINT_MAX);
l = (unsigned)sz;
fprintf(stderr,"fellow: readahead limited to "
"%u chunks * %zu chunk_bytes (%u chunk_exponent)"
" be less than 1/16 of memory\n",
l, (size_t)1 << tune->chunk_exponent, tune->chunk_exponent);
tune->readahead = l;
}
// 2MB
if (tune->chunk_exponent < 21U) {
l = 1U << (21U - tune->chunk_exponent);
......
......@@ -48,7 +48,8 @@ TUNE(unsigned, mem_reserve_chunks, 1, 0, UINT_MAX);
TUNE(size_t, objsize_hint, 256 * 1024, 4096, SIZE_MAX);
TUNE(size_t, objsize_max, 0, 0, SIZE_MAX);
TUNE(size_t, discard_immediate, 256 * 1024, 4096, SIZE_MAX);
TUNE(unsigned, readahead, 2, 0, UINT_MAX);
// 31 is safe max for stack usage, further limited by memsz
TUNE(unsigned, readahead, 5, 0, 31);
TUNE(unsigned, io_batch_min, 8, 1, UINT_MAX);
// right now, the io ring size is hardcoded to 1024, so 512 is half that
TUNE(unsigned, io_batch_max, 512, 1, UINT_MAX);
......
......@@ -672,14 +672,29 @@ fellow storage can be fine tuned:
* *readahead*
- unit: scalar
- default: 2
- default: 5
- minimum: 0
- maximum: 31 or 1/16th of *memsize*
specifies how many additional segments of an object's body should be
staged into memory asynchronously before being required. This
parameter helps keeping response times low and throughput high for
objects which are not already present in the memory cache.
The maximum is the lower of 31 or the value corresponding to 1/16th
of *memsize* divided by *chunk_bytes*.
Read ahead triggers whenever the number of read ahead segments is at
readahead / 2 (rounded down) or less. Thus, for the default value of
5, read ahead will, after the initial read of 5 segments, read 2
segments whenever 2 segments have been sent.
Note that, on a system with a decently sized memory cache, no disk
IO will happen for most requests. When segments are still in memory
cache, read ahead only references them. Disk IO is only needed for
segments which are accessed for the first time after a cache load or
LRU eviction.
* *discard_immediate*
- unit: bytes
......
......@@ -608,14 +608,29 @@ fellow storage can be fine tuned:
* *readahead*
- unit: scalar
- default: 2
- default: 5
- minimum: 0
- maximum: 31 or 1/16th of *memsize*
specifies how many additional segments of an object's body should be
staged into memory asynchronously before being required. This
parameter helps keeping response times low and throughput high for
objects which are not already present in the memory cache.
The maximum is the lower of 31 or the value corresponding to 1/16th
of *memsize* divided by *chunk_bytes*.
Read ahead triggers whenever the number of read ahead segments is at
readahead / 2 (rounded down) or less. Thus, for the default value of
5, read ahead will, after the initial read of 5 segments, read 2
segments whenever 2 segments have been sent.
Note that, on a system with a decently sized memory cache, no disk
IO will happen for most requests. When segments are still in memory
cache, read ahead only references them. Disk IO is only needed for
segments which are accessed for the first time after a cache load or
LRU eviction.
* *discard_immediate*
- unit: bytes
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment