- 07 Feb, 2024 40 commits
-
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
... such that LRU, which is operating on the temporary log, can make room. Ref #28
-
Nils Goroll authored
Ref #28
-
Nils Goroll authored
Hopefully, this also contributes to a solution for #28
-
Nils Goroll authored
Otherwise it looks like a rewrite would leak log blocks.
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
it is more important than objects Should also contribute to a fix for #28
-
Nils Goroll authored
This, hopefully, is part of a possible solution to the nasty issue #28: When we do not have a sufficiently large pre-allocated log (log region) as determined by objsize_hint in relation to the storage size, we need to dynamically allocate disk blocks while we flush the log. When the log flush includes object deletions (in particular when triggered from the disk LRU), we run into a typical deadlock: To complete the transaction to free space, we need the space... This commit is part of an attempt to make this work by allocating space early on: When we only have 20% of the log region left, we start to reserve more blocks for the log. The problem can, for example, be reproduced with an objsize_hint of 1MB and an actual object size in the oder of 32KB. Ref #28
-
Nils Goroll authored
Manually tested with this modification: diff --git a/src/fellow_log.c b/src/fellow_log.c index 6075d81..45da269 100644 --- a/src/fellow_log.c +++ b/src/fellow_log.c @@ -1696,6 +1696,9 @@ fellow_io_regions_discard(struct fellow_fd *ffd, void *ioctx, r = fallocate(ffd->fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, (off_t)todo->offset, (off_t)todo->len); + // XXX TEST + r = 1; + errno = EOPNOTSUPP; if (r == 0) { if ((ffd->cap & FFD_CAN_FALLOCATE_PUNCH_URING) == 0) { ffd->diag("fellow: fallocate punch" Fixes #38
-
Nils Goroll authored
to make clear that we understand exactly what is happening.
-
Nils Goroll authored
For streaming busy objects, we basically rely on the varnish-cache ObjExtend() / ObjWaitExtend() API to never read past the object: In fellow_stream_f(), we always wait for more data (or the end of the object) before returning, such that fellow_cache_obj_iter(), which iterates over segments, should never touch a segment past the final FCS_BUSY segment. Yet - it did, by means of the read-ahead and the peek-ahead to determine whether or not OBJ_ITER_END should be signaled. We fix this issue by reading/peeking ahead only for segments with a state beyond FCS_BUSY. There is now also extensive test infrastructure to specifically test concurrent access ti busy objects. To keep layers separate, fellow_cache_test uses a lightweight signal/wait implementation analogous to the ObjExtend() / ObjWaitExtend() Varnish-Cache interface. An earlier version of t_busyobj() had run on my dev laptop for 3.5 hours without crashing, while without the fixes it had run into assertion failures within seconds. Fixes #35 and #36 (I hope)
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
... to make it easier to follow the code in fellow_cache_test motivated by #35
-
Nils Goroll authored
-
Nils Goroll authored
... such that the total reserve is no less than 2MB. This is required for stable operation of LRU when the log is full. Ref #28
-
Nils Goroll authored
-
Nils Goroll authored
Should be irrelevant in practice, because we would not flush a single block during startup.
-
Nils Goroll authored
When some blocks were already allocated, we would fail to use all of the log region, that is, the newly added assertion if (n > 0) AZ(logreg->free_n); would fail This left some blocks of the logregion unused, but was insignificant otherwise.
-
Nils Goroll authored
Unfortunately, this was present even in the initial public release 58ec40f9 This issue should have had no production impact, but it made hunting down bugs unnecessary hard.
-
Nils Goroll authored
When we work on the last segment, the remaining length is zero, but we still have a current pointer and length. This was a particularly annoying glitch because I wrote almost the same code for varnish-cache with the equivalent assertion in the right place :( Sorry Ref https://github.com/varnishcache/varnish-cache/pull/4013/commits/8ec77190d91603c8f0dead0cee013e3c9ca8fa78#diff-f79cfeda8456789ae873270aefa58e8f1e94213ee16d32ea96b8db8a7013ebf8R790 Closes #34
-
Nils Goroll authored
it is planned to replace the "inuse" tri-state and might turn out helpful for debugging.
-
Nils Goroll authored
-
Nils Goroll authored
https://github.com/varnishcache/varnish-cache/pull/4013 fixes two issues in Varnish-Cache, which are relevant for SLASH/fellow and of which the first is the root cause of #33. This commit works around these issues until the fix gets merged: Because of the wrong use of the .objtrimstore API function by varnish-cache, we remove it from our obj_methods and exploit the fact that varnish-cache always sets the OA_LEN attribute when the object is complete: We move the trimstore function there, effectively calling it at the right time only. The inefficient memory allocation fixed in the second commit of VC#4013 is particularly relevant for fellow, because it causes the allocation code to assume that the object might grow up to the maximum possible size, which causes a substantial over-allocation. We work around this issue for the case that a 304 copy is made from fellow to fellow by using private thread-local storage to emulate basically the same function as the #4013 fix. Closes #33 Ref https://github.com/varnishcache/varnish-cache/pull/4013
-
Nils Goroll authored
Ref #33 Ref https://github.com/varnishcache/varnish-cache/pull/4013
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
These would have made analyzing #33 much easier. :|
-
Nils Goroll authored
motivated by #32
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
-
Nils Goroll authored
Spotted by Thomas Gleixner <tglx@linutronix.de>, THANK YOU forkrun() never properly handled the case that a child exited before the timeout expired, because we had failed to block the signal and hence never received a SIGCHLD. This was overlooked because this functionality was never relevant (it only delayed test execution) and because we did not explicitly test it. Related to #31
-
Nils Goroll authored
Should fix #32
-
Nils Goroll authored
See #31
-