Page MenuHomePhabricator

varnish-be: rate of accepted sessions keeps on increasing
Closed, DuplicatePublic

Description

The rate of sessions accepted (MAIN.sess_conn) keeps on increasing on varnish backends, without any significant change in frontend traffic patterns, since the upgrade to varnish 5:

rate(varnish_main_sessions{layer="$layer", type=~"conn", job="varnish-$cache_type",instance=~"($server):.*"}[5m])

esams-backend-accepted-session-rate.png (873×1 px, 103 KB)

The "steps" there happen when a varnish backend is restarted by cron.

The increasing rate does not seem to correlate with the number of established connections on the host. There is, however, a significant increase in the number of connections in time-wait. See for example cp3040:

cp3040-tcp-stats.png (873×1 px, 259 KB)

As an immediate mitigation in text-esams, I've rebooted all hosts (which had to be done anyways as part of T188092). That resulted in a decrease in fe<->be session rate as well as an increase in fe<->be session reuse rate.

session-rate-with-reboots.png (873×1 px, 127 KB)

session-reuse-rate.png (873×1 px, 235 KB)

Open questions:

What is causing the increase? Is this related to T181315 and T174932? Is the varnish-be-restart switch from weekly to twice a week going to cause harm in this context? Would upgrading to varnish 5.1.3-wm4 change anything?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Mar 16 2018, 5:05 PM
ema updated the task description. (Show Details)

Change 419705 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: move gethdr_extrachance to runtime_params

https://gerrit.wikimedia.org/r/419705

Change 419705 merged by Ema:
[operations/puppet@production] varnish: move gethdr_extrachance to runtime_params

https://gerrit.wikimedia.org/r/419705

Further investigation today showed that the cause of this is VCL staying in the auto/busy state. All those VCLs' probes keep on running. At a certain point, text@esams had a total of 65 frontend VCLs in state warm or busy. With one probe every 100ms, that gives 650 backend requests per second, which matches what we see in the graphs.

The question now is: why is our old vcl stuck in auto/busy (not transitioning to auto/cold, that is). This varnish bug seems related: https://github.com/varnishcache/varnish-cache/issues/2228

Change 420388 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add more sleep delay on varnish restarts

https://gerrit.wikimedia.org/r/420388

Change 420389 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] bump vcl reload delay by 3.6s

https://gerrit.wikimedia.org/r/420389

Change 420390 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Increase varnish probe interval to 1s

https://gerrit.wikimedia.org/r/420390

Change 420432 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] auto-discard vcls when reloading

https://gerrit.wikimedia.org/r/420432

Change 420432 merged by BBlack:
[operations/puppet@production] auto-discard vcls when reloading

https://gerrit.wikimedia.org/r/420432

Change 420388 merged by BBlack:
[operations/puppet@production] Add more sleep delay on varnish restarts

https://gerrit.wikimedia.org/r/420388

Change 420389 merged by BBlack:
[operations/puppet@production] bump vcl reload delay by 3.6s

https://gerrit.wikimedia.org/r/420389

Change 420390 merged by BBlack:
[operations/puppet@production] Increase varnish probe interval to 1s

https://gerrit.wikimedia.org/r/420390

Some notes from digging around in related things:

  • Varnish docs claim that duplicate probes (e.g. due to vcl reloads) are coalesced into a single probe, near the bottom of https://varnish-cache.org/docs/5.1/users-guide/vcl-backends.html#health-checks . This appears to be BS, as we can see the multiple probes running pretty clearly.
  • Setting the probes of an old VCL to an explicit administrative health (e.g. varnishadm backend.set_health e3032122-59c9-4490-8006-8fbc6ab00198.be_cp1065_eqiad_wmnet sick) doesn't stop the probing. The explicit admin_health of sick here overrides the probe's results, but does not stop the probe from executing.
  • There's doesn't appear to be a VCL equivalent of the CLI's set_health command.

Since confd-driven depool/repool is the primary cause of excessive VCL reloads causing excessive probes, I'd suggest we could do the following (in some order that makes sense for sane transition/deployment):

  1. Move the backend defs back into static VCL rather than confd-templated. They'll all exist all the time statically whether they're depooled or not. Adding/removing machines from the set will mean puppet commits, but this doesn't seem crazy.
  2. Have confd instead template a file that's much simpler and has one line per backend saying something like be_cp1066_eqiad_wmnet sick for depooled or be_cp1066_eqiad_wmnet auto for pooled. The file can be placed at some standard pathname like /etc/varnish/backend.states or similar
  3. On update it should re-write this templated file, and instead of executing a vcl reload, it should execute a new script which looks at the total set of warm VCLs from varnishadm, and executes set health commands against them all for these (the total count of set_health commands executed is n_warm_vcls * n_backends).
  4. Over in our actual vcl_reload script, we could also read the same statefile and execute the same commands when a new VCL is loaded, after the load but before the use, to bring the new states forward.
  5. This leaves the case of the boot VCL in question. Clearly, we'll also need to execute these commands after the initial load of the boot VCL on daemon startup, and not consider the daemon live (e.g. pool-able) until they're done. This is perhaps possible with systemd ExecStartPost?

Even with all of that in place, we might still want ~weekly frontend restarts to get rid of accumulated warm VCLs over time.

Ideally, all of this would matter less, and we wouldn't need the various workarounds mentioned or already-implemented here (slow probe intervals, avoiding reloads on pool-action, weekly fe restarts), if we could fix a couple design flaws via backporting or local work:

  1. Fix the leaks of VCL references, however it is that they're happening, so that we're reasonably-assured that it's usually the case that only the active VCL is alive for the long-term, and the others will cool off and discard in reasonable time. Even varnish devs don't seem to understand this clearly in https://github.com/varnishcache/varnish-cache/issues/2228 , so I'm not sure how high our odds are of successfully patching it with minimal effort.
  2. Fix probing to either not require Connection: close-like behavior, or to not share connection pools with live traffic. The combination of the two is devastating with enough probes happening due to the accumulated leaked old VCLs, as the probes are constantly churning our supposedly-persistent inter-varnish connections, at rates which probably create TIME_WAIT issues and socket namespace exhaustion, thus leading to connection/request failures/queueing.

Change 420927 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: fixed 20m grace value regardless of health

https://gerrit.wikimedia.org/r/420927

Change 420928 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: only keep objects worth keep-ing

https://gerrit.wikimedia.org/r/420928

Change 420927 merged by BBlack:
[operations/puppet@production] VCL: fixed 20m grace value regardless of health

https://gerrit.wikimedia.org/r/420927

Change 420928 merged by BBlack:
[operations/puppet@production] VCL: only keep objects worth keep-ing

https://gerrit.wikimedia.org/r/420928