• Nils Goroll's avatar
    Various performance and stability improvements, hash/data table separation · ac9eb9bc
    Nils Goroll authored
    major changes
    =============
    
    hash/data table
    ---------------
    
    The hash table is now only used for _OPEN records, and the actual data
    is stored in a data table. Upon submit, hash entries are cleared and
    data continues to live in the data table until it gets freed by a
    worker (or upon submit if it is a NODATA record).
    
    This drastically reduces the hash table load and significantly
    increases worst case performance. In particular, the hash table load
    is now independend of ActiveMQ backend performance (read: stalls).
    
    Preliminary recommendations fon table sizing:
    
    * hash table: double max_sessions from varnish
    
      e.g.
    
      maxopen.scale = 16
    
      for 64K hash table entries to support >32K sessions
      (savely and efficiently)
    
    * data table: max(req/s) * max(ActiveMQ stall time)
    
      e.g. to survive 8000 req/s with 60 seconds ActiveMQ stall time,
      the data table should be >240K in size, so
    
      maxdone.scale = 19
    
      (= 512K entries) should be on the safe side also to provide
      sufficient buffer for temporary load peaks
    
    hash table performance
    ----------------------
    
    Previously, the maximum number of probes to the hash table was set to
    the hash table size - which resulted in bad insert performance and
    even worse lookup performance.
    
    Now that the hash table is freed of _OPEN records, we can remove this
    burden and limit the maximum number of probles to a sensible value (10
    to start with, configurable as hash_max_probes.
    
    As another consequence, as we don't require 100% capacity on the hash
    table, we don't need to run an exhaustive search upon insert. Thus,
    probing has been changed from liner to hash (by h2()).
    
    only ever insert on ReqStart - and drop if we can't
    ---------------------------------------------------
    
    Keeping up with the VSL is essential. Once we fall behind, we are in
    real trouble:
    
    - If we miss ReqEnd, we will clobber our hash, with drastic effects:
      - hash lookups become inefficient
      - inserts become more likely to fail
      - before we had HASH_Exp (see below), the hash would become useless
    
    - When the VSL writer overtakes our reader, we will see corrupt data
      and miss _many_ VCL Logs and ReqEnds (as many as can be found in the
      whole VSL), so, again, our hash and data arrays will get clobbered
      with incomplete data (which needs to be cleaned up by HASH_Exp).
    
    The latter point is the most relevant, corrupt records are likely to
    trigger assertions.
    
    Thus, keeping up with the VSL needs to be our primary objective. When
    the VSL overtakes, we will loose a massive amount auf reconds anyway
    (and we won't even know how many). As long as we don't stop Varnish
    when we fall behind, we can't avoid loosing records under certain
    circumstances anway (for instance, when the backend stalls and the
    data table runs full), so we should rather drop early, in a controlled
    manner - and without drastic performance penalty.
    
    Under this doctrine, it does not make sense to insert records for
    VSL_Log or ReqEnd, so if an xid can't be found for these tags, the
    respective events will get dropped (and logged).
    
    performance optimizations
    =========================
    
    spmcq reader/writer synchronization
    -----------------------------------
    
    Various measures have been implemented to reduce syscall and general
    function call overhead for reader/writer synchroniration on the
    spmcq. Previously, the writer would issue a pthread_cond_signal to
    potentially wake up a reader, irrespective of whether or not a reader
    was actually blocking on the CV.
    
    - now, the number of waiting readers (workers) is modified inside a
      lock, but queried first from outside the lock, so if there are no
      readers waiting the CV is not signalled.
    
    - The number of running readers is (attempted to be) kept proportional
      to the queue length for queue lengths between 0 and
      2^qlen_goal.scale to further reduce the number of worker thread
      block/wakeup transitions under low to averade load.
    
    pthread_mutex / ptherad_condvar attributes
    ------------------------------------------
    
    Attributes are now being used to allow the O/S implementation to
    choose more efficient low-level synchronization primitives because we
    know that we are using these only within one multi-threaded process.
    
    data table freelist
    -------------------
    
    To allow for efficient allocation of new data table entries, a free
    list with local caches is maintained:
    
    - The data writer (VSL reader thread) maintains its own freelist and
      serves requests from it without any synchronization overhead.
    
    - Only when the data writer's own freelist is exchausted will it
      access the global freelist (under a lock). It will take the whole
      list at once and resume serving new records from its own cache.
    
    - Workers also maintain their own freelist of entries to be returned
      to the global freelist as long as
    
      - they are running
      - there are entries on the global list.
    
      Before a worker thread goes to block on the spmcq condvar, it
      returns all its freelist entries to the global freelist. Also, it
      will always check if the global list is empty and return any entries
      immediately if it is.
    
    stability improvements
    ======================
    
    record timeouts
    ---------------
    
    Every hash entry gets added to the insert_list ordered by insertion
    time. Not any more often then x seconds (currently hard-coded to x=10,
    check only performed when ReqStart is seen), the list is checked for
    records which have reached their ttl (configured by hash_ttl, default
    120 seconds). These get submitted despite the fact that no ReqEnd has
    been seen - under the assumption that no ReqEnd is ever to be expected
    after a certain time has passed.
    
    hash evacuation
    ---------------
    
    If no free entry is found when probing all possible locations for an
    insert, the oldest record is evacuated from the hash and submitted to
    the backend if its live time has exceeded hash_mlt under the
    assumption that it is better to submit records early (which are likely
    to carry useful log information already) than throwing away records.
    
    If this behavior is not desired, hash_mtl can be set to hash_ttl.
    
    various code changes
    ====================
    
    * statistics have been reorganized to seperate out
      - hash
      - data writer/VSL reader
      - data reader/worker (partially shared with writer)
      statistics
    
    * print the native thread ID for workers (to allow to correllation
      with prstat/top output)
    
    * workers have a new state when blocking on the spmcq CV: WRK_WAITING
      / "waiting" in monitor output
    
    * because falling behind with VSL reading (the VSL writer overtaking
      our reader) is so bad, notices are logged whenever the new VSL data
      pointer is less than the previous one, iow the VSL ring buffer
      wraps.
    
      this is not the same as a detection of the VSL writer overtaking
      (which would require varnishapi changes), but noting information and
      some statistics about VSL wraps can (and did) help analyze track
      down strance issues to VSL overtaking.
    
    config file changes
    ===================
    
    * The _scale options
    
      maxopen.scale
      maxdone.scale (new, see below)
      maxdata.scale
    
      are now being used directly, rather than in addition to a base value
      of 10 as before.
    
      10 is now the minimum value and an EINVAL error will get thrown
      when lower values are used in the config file.
    
    new config options
    ==================
    
    see trackrdrd.h for documentation in comments:
    
    * maxdone.scale
    
      Scale for records in _DONE states, determines size of
      - the data table (which is maxopen + maxdone)
      - the spmcq
    
    * qlen_goal.scale
    
      Scale for the spmcq queue length goal. All worker threads will be
      used when the queue length corresponding to the scale is reached.
    
      For shorter queue lengths, the number of worker threads will be
      scaled propotionally.
    
    * hash_max_probes
    
      Maximum number of probes to the hash.
    
      Smaller values increase efficiency, but reduce the capacity of the
      hash (more ReqStart records may get lost) - and vice versa for
      higher values.
    
    * hash_ttl
    
      Maximum time to live for records in the _OPEN state
    
      Entries which are older than this ttl _may_ get expired from the
      trackrdrd state.
    
      This should get set to a value significantly longer than your
      maximum session lifetime in Varnish.
    
    * hash_mlt
    
      Minimum lifetime for entries in HASH_OPEN before they could get
      evacuated.
    
      Entries are guaranteed to remain in trackrdrd for this duration.
      Once the mlt is reached, they _may_ get expired when trackrdrd needs
      space in the hash.
    ac9eb9bc
Name
Last commit
Last update
doc Loading commit data...
etc Loading commit data...
m4 Loading commit data...
rpmbuild/SPECS Loading commit data...
src Loading commit data...
.gitignore Loading commit data...
Makefile.am Loading commit data...
README.rst Loading commit data...
autogen.sh Loading commit data...
build-ci.sh Loading commit data...
configure.ac Loading commit data...
doxygen-include.am Loading commit data...
doxygen.cfg Loading commit data...
options.txt Loading commit data...
synopsis.txt Loading commit data...