cache_upload varnish-fe exhausting transient memory
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Apr 9 2020, 9:41 AM

Description

Beginning 2020-04-07 around 15:22 UTC, various cache_upload varnish frontends have started showing issues related to transient memory usage:

Screenshot from 2020-04-09 11-29-16.png (1×2 px, 207 KB)

This causes fetch failures in esams, eqsin and eqiad.

While cache_text has had a limit on transient storage usage for a long time, this was not the case on cache_upload. We have started limiting transient storage on cache_upload on 2020-03-17 to address T185968.

To see if a larger amount of transient storage would help we have doubled the limit from 5G to 10G yesterday 2020-04-08 at 16ish. Unfortunately transient storage quickly filled up on the affected hosts after restarting varnish:

Screenshot from 2020-04-09 11-43-36.png (1×2 px, 281 KB)

The following theory has been ruled out; after downgrading varnish on cp3051 the problem persists:
On 2020-04-06 around 13:05 we have deployed a patched varnish version addressing a bug that was causing varnishd crashes: T249344. Our patch was merged upstream and we backported it to 5.x. The patch may or may not have something to do with the issue described here.

Details

Subject	Repo	Branch	Lines +/-
cp3051: end large_objects_cutoff experiment	operations/puppet	production	+0 -12
cp3051: lower large_objects_cutoff to 288K	operations/puppet	production	+1 -1
cp3051: set large_objects_cutoff to 384K	operations/puppet	production	+12 -0
vcl: set large_objects_cutoff in hiera	operations/puppet	production	+3 -9
Revert "vcl: test 'exp' admission policy on two nodes"	operations/puppet	production	+0 -32
vcl: test 'exp' admission policy on two nodes	operations/puppet	production	+32 -0
vcl: pass fe_mem_gb to vcl_config	operations/puppet	production	+13 -12
vcl: 10M cutoff for the 'exp' admission policy	operations/puppet	production	+18 -11
vcl: move 'exp' admission policy to wm_admission_policies	operations/puppet	production	+20 -21
vcl: introduce wm_admission_policies	operations/puppet	production	+19 -26
vcl: remove n-hit-wonder admission policy	operations/puppet	production	+4 -49

Related Objects

Mentioned In: T235736: cp3032 and cp3040 occasional failed fetches
T144187: Better handling for one-hit-wonder objects
Mentioned Here: T144187: Better handling for one-hit-wonder objects
T185968: varnish 5.1.3 frontend child restarted
T249344: varnishd crashes in vbf_stp_condfetch(): cp3057 and cp3061

Event Timeline

• ema created this task.Apr 9 2020, 9:41 AM

Restricted Application added a project: SRE. · View Herald TranscriptApr 9 2020, 9:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as High priority.Apr 9 2020, 9:41 AM

• ema renamed this task from varnish-fe exhausting transient memory to cache_upload varnish-fe exhausting transient memory.Apr 9 2020, 9:43 AM

• ema updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-04-09T09:46:07Z] <ema> cp3051: disable transient storage limit and restart varnish-fe T249809

Mentioned in SAL (#wikimedia-operations) [2020-04-09T10:30:17Z] <ema> cp3051: re-enable transient storage limit, downgrade varnish to 5.1.3-1wm12 (no 0035-vbf_stp_condfetch_crash.patch) and restart varnish-fe T249809

Mentioned in SAL (#wikimedia-operations) [2020-04-09T10:43:27Z] <ema> repool cp3051 T249809

• ema updated the task description. (Show Details)Apr 9 2020, 12:27 PM

Mentioned in SAL (#wikimedia-operations) [2020-04-09T12:31:43Z] <ema> cp3051: upgrade varnish to 5.1.3-1wm13 once again, restart varnish-fe T249809

• ema moved this task from Backlog to Caching on the Traffic board.Apr 9 2020, 3:24 PM

Change 588650 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

gerritbot added a project: Patch-For-Review.Apr 14 2020, 7:59 AM

Change 588650 merged by Ema:
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588945 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 588945 merged by Ema:
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 merged by Ema:
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 merged by Ema:
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 594126 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Change 594126 merged by Ema:
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 merged by Ema:
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:19:12Z] <ema> cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

Stashbot mentioned this in T144187: Better handling for one-hit-wonder objects.May 5 2020, 8:19 AM

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

In T249809#6107775, @Stashbot wrote:

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

The 'exp' admission policy experiment is still ongoing, I'll wait for tomorrow before drawing conclusions. What I have noticed just now however, is that the median Content-Length of responses for which we use transient on upload@esams isn't incredibly high:

15:13:57 ema@cp3051.esams.wmnet:~
$ timeout --foreground 60 varnishncsa -g request -n frontend -q 'Storage eq "malloc Transient"' -F '%{Content-Length}o'| sort -n | awk '{v[NR]=$1} END { print "med", v[int(NR*.5)] ; print "p75", v[int(NR*.75)] ; print "p95", v[int(NR*.95)] }'
med 370661
p75 912115
p95 5036568

It might be worth playing a little with the current pass size cutoff for the simple size-based admission policy that we're using by default. Right now the cutoff is 256K, maybe increasing it a bit would help with transient usage without having a negative impact on hitrate?

Change 594896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594923 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594923 merged by Ema:
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594896 merged by Ema:
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594971 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: set large_objects_cutoff to 384K

https://gerrit.wikimedia.org/r/594971

Change 594971 merged by Ema:
[operations/puppet@production] cp3051: set large_objects_cutoff to 384K

https://gerrit.wikimedia.org/r/594971

Change 596430 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: lower large_objects_cutoff to 288K

https://gerrit.wikimedia.org/r/596430

Change 596430 merged by Ema:
[operations/puppet@production] cp3051: lower large_objects_cutoff to 288K

https://gerrit.wikimedia.org/r/596430

It might be worth experimenting with enabling request coalescing for large files. That could help reducing pressure on transient I think, worth giving it a shot next week.

Change 597008 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: end large_objects_cutoff experiment

https://gerrit.wikimedia.org/r/597008

Change 597008 merged by Ema:
[operations/puppet@production] cp3051: end large_objects_cutoff experiment

https://gerrit.wikimedia.org/r/597008

BBlack mentioned this in T235736: cp3032 and cp3040 occasional failed fetches.Sep 23 2020, 5:58 PM

ArielGlenn removed a project: Patch-For-Review.Sep 28 2020, 12:15 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

@Vgutierrez, @BBlack: What's the status of this? It's three years old but I took a look at the transient memory and there are only a few instances where it maxed out, all of which were freed up shortly afterwards.

I haven't been able to see any indication that this has been an issue for the entirety of our metrics. @ema's great work likely has fixed this, so I'll mark as closed. Thanks, Ema!

	F31746187: Screenshot from 2020-04-09 11-43-36.png
	Apr 9 2020, 9:43 AM

	F31746166: Screenshot from 2020-04-09 11-29-16.png
	Apr 9 2020, 9:41 AM

cache_upload varnish-fe exhausting transient memoryClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

cache_upload varnish-fe exhausting transient memory
Closed, ResolvedPublic
Actions