• Nils Goroll's avatar
    Add a regression test for #1762 · 20362bf8
    Nils Goroll authored
    Further investigating into root cause scenarios resulted in the
    following insights:
    
    * the bad vxid must have got into vtx->key.vxid by way of
      `vtx_parse_link`
    
    * which is only called for `SLT_Begin` (`vtx_scan_begin()`) and
      `SLT_Link` (`vtx_scan_link()`)
    
    (actually this was known before, but I am now confident that these are
    the only cases)
    
    There is no case in the code as of 4.0.3 release where `SLT_Begin` is
    emitted with an unmasked vxid, so the issue must be root casue in an
    `SLT_Link` link record.
    
    In both cases where unmasked vxids are emitted for `SLT_Link`, the id
    comes directly from `VXID_Get()`:
    
    * `cache_fetch.c`
    
      wid = VXID_Get(&wrk->vxid_pool);
      VSLb(bo->vsl, SLT_Link, "bereq %u retry", wid);
    
    * `cache_req_fsm.c`
    
      wid = VXID_Get(&wrk->vxid_pool);
      // XXX: ReqEnd + ReqAcct ?
      VSLb_ts_req(req, "Restart", W_TIM_real(wrk));
      VSLb(req->vsl, SLT_Link, "req %u restart", wid);
    
    So unless I have overseen anything significant, the root cause must
    have been a vxid spill, which was fixed with
    0dd8c0b8 (master) /
    171f3ac5 (4.0)
    
    `VXID()` masking would have avoided the issue to surface.
    
    This insight is consistent with two observations:
    
    * the issue only surfaced after `varnishd` running for longer periods
      of time
    
    * the issue didn't go away after a restart of the vsl client, a
      `varnishd` restart was required
    
    This gives confidence that the issue has really been understood
    completely and that the root cause has been fixed.
    20362bf8