varnish-be: rate of accepted sessions keeps on increasing
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• ema
	Mar 16 2018, 5:05 PM

Description

The rate of sessions accepted (MAIN.sess_conn) keeps on increasing on varnish backends, without any significant change in frontend traffic patterns, since the upgrade to varnish 5:

rate(varnish_main_sessions{layer="$layer", type=~"conn", job="varnish-$cache_type",instance=~"($server):.*"}[5m])

esams-backend-accepted-session-rate.png (873×1 px, 103 KB)

The "steps" there happen when a varnish backend is restarted by cron.

The increasing rate does not seem to correlate with the number of established connections on the host. There is, however, a significant increase in the number of connections in time-wait. See for example cp3040:

As an immediate mitigation in text-esams, I've rebooted all hosts (which had to be done anyways as part of T188092). That resulted in a decrease in fe<->be session rate as well as an increase in fe<->be session reuse rate.

session-rate-with-reboots.png (873×1 px, 127 KB)

session-reuse-rate.png (873×1 px, 235 KB)

Open questions:

What is causing the increase? Is this related to T181315 and T174932? Is the varnish-be-restart switch from weekly to twice a week going to cause harm in this context? Would upgrading to varnish 5.1.3-wm4 change anything?

Details

Subject	Repo	Branch	Lines +/-
VCL: only keep objects worth keep-ing	operations/puppet	production	+6 -4
VCL: fixed 20m grace value regardless of health	operations/puppet	production	+3 -27
Increase varnish probe interval to 1s	operations/puppet	production	+1 -1
bump vcl reload delay by 3.6s	operations/puppet	production	+1 -1
Add more sleep delay on varnish restarts	operations/puppet	production	+3 -3
auto-discard vcls when reloading	operations/puppet	production	+1 -1
varnish: move gethdr_extrachance to runtime_params	operations/puppet	production	+4 -1

Customize query in gerrit

Related Objects

Mentioned In: T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)
T125938: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data"
Mentioned Here: T174932: Recurrent 'mailbox lag' critical alerts and 500s
T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)

Event Timeline

• ema created this task.Mar 16 2018, 5:05 PM

Restricted Application added a project: SRE. · View Herald TranscriptMar 16 2018, 5:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Mar 16 2018, 5:05 PM

• ema updated the task description. (Show Details)Mar 16 2018, 5:08 PM

• ema updated the task description. (Show Details)

• ema updated the task description. (Show Details)Mar 16 2018, 5:19 PM

• ema moved this task from Backlog to Caching on the Traffic board.Mar 19 2018, 9:36 AM

Change 419705 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: move gethdr_extrachance to runtime_params

https://gerrit.wikimedia.org/r/419705

gerritbot added a project: Patch-For-Review.Mar 19 2018, 9:55 AM

Change 419705 merged by Ema:
[operations/puppet@production] varnish: move gethdr_extrachance to runtime_params

https://gerrit.wikimedia.org/r/419705

Mentioned in SAL (#wikimedia-traffic) [2018-03-19T14:18:17Z] <ema> cp3040: discard old VCL T189892

Further investigation today showed that the cause of this is VCL staying in the auto/busy state. All those VCLs' probes keep on running. At a certain point, text@esams had a total of 65 frontend VCLs in state warm or busy. With one probe every 100ms, that gives 650 backend requests per second, which matches what we see in the graphs.

The question now is: why is our old vcl stuck in auto/busy (not transitioning to auto/cold, that is). This varnish bug seems related: https://github.com/varnishcache/varnish-cache/issues/2228

Change 420388 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add more sleep delay on varnish restarts

https://gerrit.wikimedia.org/r/420388

Change 420389 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] bump vcl reload delay by 3.6s

https://gerrit.wikimedia.org/r/420389

Change 420390 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Increase varnish probe interval to 1s

https://gerrit.wikimedia.org/r/420390

Change 420432 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] auto-discard vcls when reloading

https://gerrit.wikimedia.org/r/420432

Change 420432 merged by BBlack:
[operations/puppet@production] auto-discard vcls when reloading

https://gerrit.wikimedia.org/r/420432

Change 420388 merged by BBlack:
[operations/puppet@production] Add more sleep delay on varnish restarts

https://gerrit.wikimedia.org/r/420388

Change 420389 merged by BBlack:
[operations/puppet@production] bump vcl reload delay by 3.6s

https://gerrit.wikimedia.org/r/420389

Change 420390 merged by BBlack:
[operations/puppet@production] Increase varnish probe interval to 1s

https://gerrit.wikimedia.org/r/420390

Some notes from digging around in related things:

Varnish docs claim that duplicate probes (e.g. due to vcl reloads) are coalesced into a single probe, near the bottom of https://varnish-cache.org/docs/5.1/users-guide/vcl-backends.html#health-checks . This appears to be BS, as we can see the multiple probes running pretty clearly.
Setting the probes of an old VCL to an explicit administrative health (e.g. varnishadm backend.set_health e3032122-59c9-4490-8006-8fbc6ab00198.be_cp1065_eqiad_wmnet sick) doesn't stop the probing. The explicit admin_health of sick here overrides the probe's results, but does not stop the probe from executing.
There's doesn't appear to be a VCL equivalent of the CLI's set_health command.

Since confd-driven depool/repool is the primary cause of excessive VCL reloads causing excessive probes, I'd suggest we could do the following (in some order that makes sense for sane transition/deployment):

Move the backend defs back into static VCL rather than confd-templated. They'll all exist all the time statically whether they're depooled or not. Adding/removing machines from the set will mean puppet commits, but this doesn't seem crazy.
Have confd instead template a file that's much simpler and has one line per backend saying something like be_cp1066_eqiad_wmnet sick for depooled or be_cp1066_eqiad_wmnet auto for pooled. The file can be placed at some standard pathname like /etc/varnish/backend.states or similar
On update it should re-write this templated file, and instead of executing a vcl reload, it should execute a new script which looks at the total set of warm VCLs from varnishadm, and executes set health commands against them all for these (the total count of set_health commands executed is n_warm_vcls * n_backends).
Over in our actual vcl_reload script, we could also read the same statefile and execute the same commands when a new VCL is loaded, after the load but before the use, to bring the new states forward.
This leaves the case of the boot VCL in question. Clearly, we'll also need to execute these commands after the initial load of the boot VCL on daemon startup, and not consider the daemon live (e.g. pool-able) until they're done. This is perhaps possible with systemd ExecStartPost?

Even with all of that in place, we might still want ~weekly frontend restarts to get rid of accumulated warm VCLs over time.

Ideally, all of this would matter less, and we wouldn't need the various workarounds mentioned or already-implemented here (slow probe intervals, avoiding reloads on pool-action, weekly fe restarts), if we could fix a couple design flaws via backporting or local work:

Fix the leaks of VCL references, however it is that they're happening, so that we're reasonably-assured that it's usually the case that only the active VCL is alive for the long-term, and the others will cool off and discard in reasonable time. Even varnish devs don't seem to understand this clearly in https://github.com/varnishcache/varnish-cache/issues/2228 , so I'm not sure how high our odds are of successfully patching it with minimal effort.
Fix probing to either not require Connection: close-like behavior, or to not share connection pools with live traffic. The combination of the two is devastating with enough probes happening due to the accumulated leaked old VCLs, as the probes are constantly churning our supposedly-persistent inter-varnish connections, at rates which probably create TIME_WAIT issues and socket namespace exhaustion, thus leading to connection/request failures/queueing.

Change 420927 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: fixed 20m grace value regardless of health

https://gerrit.wikimedia.org/r/420927

Change 420928 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: only keep objects worth keep-ing

https://gerrit.wikimedia.org/r/420928

Change 420927 merged by BBlack:
[operations/puppet@production] VCL: fixed 20m grace value regardless of health

https://gerrit.wikimedia.org/r/420927

Change 420928 merged by BBlack:
[operations/puppet@production] VCL: only keep objects worth keep-ing

https://gerrit.wikimedia.org/r/420928

BBlack mentioned this in T125938: PHP fatal errors causing Varnish to return 503 - "Junk after gzip data".Mar 23 2018, 1:10 PM

BBlack closed this task as a duplicate of T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache).Mar 26 2018, 5:24 PM

Krinkle mentioned this in T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache).Mar 29 2018, 1:18 AM

	F15602996: esams-backend-accepted-session-rate.png
	Mar 16 2018, 5:05 PM

	F15604001: session-rate-with-reboots.png
	Mar 16 2018, 5:05 PM

varnish-be: rate of accepted sessions keeps on increasingClosed, DuplicatePublicActions

Description

Details

Related Objects

Event Timeline

varnish-be: rate of accepted sessions keeps on increasing
Closed, DuplicatePublic
Actions