Page MenuHomePhabricator

cache_upload varnish-fe exhausting transient memory
Closed, ResolvedPublic

Description

Beginning 2020-04-07 around 15:22 UTC, various cache_upload varnish frontends have started showing issues related to transient memory usage:

Screenshot from 2020-04-09 11-29-16.png (1×2 px, 207 KB)

This causes fetch failures in esams, eqsin and eqiad.

While cache_text has had a limit on transient storage usage for a long time, this was not the case on cache_upload. We have started limiting transient storage on cache_upload on 2020-03-17 to address T185968.

To see if a larger amount of transient storage would help we have doubled the limit from 5G to 10G yesterday 2020-04-08 at 16ish. Unfortunately transient storage quickly filled up on the affected hosts after restarting varnish:

Screenshot from 2020-04-09 11-43-36.png (1×2 px, 281 KB)

The following theory has been ruled out; after downgrading varnish on cp3051 the problem persists:
On 2020-04-06 around 13:05 we have deployed a patched varnish version addressing a bug that was causing varnishd crashes: T249344. Our patch was merged upstream and we backported it to 5.x. The patch may or may not have something to do with the issue described here.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as High priority.Apr 9 2020, 9:41 AM
ema renamed this task from varnish-fe exhausting transient memory to cache_upload varnish-fe exhausting transient memory.Apr 9 2020, 9:43 AM
ema updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-04-09T09:46:07Z] <ema> cp3051: disable transient storage limit and restart varnish-fe T249809

Mentioned in SAL (#wikimedia-operations) [2020-04-09T10:30:17Z] <ema> cp3051: re-enable transient storage limit, downgrade varnish to 5.1.3-1wm12 (no 0035-vbf_stp_condfetch_crash.patch) and restart varnish-fe T249809

Mentioned in SAL (#wikimedia-operations) [2020-04-09T12:31:43Z] <ema> cp3051: upgrade varnish to 5.1.3-1wm13 once again, restart varnish-fe T249809

Change 588650 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588650 merged by Ema:
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588945 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 588945 merged by Ema:
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 merged by Ema:
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 merged by Ema:
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 594126 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Change 594126 merged by Ema:
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 merged by Ema:
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:19:12Z] <ema> cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

The 'exp' admission policy experiment is still ongoing, I'll wait for tomorrow before drawing conclusions. What I have noticed just now however, is that the median Content-Length of responses for which we use transient on upload@esams isn't incredibly high:

15:13:57 ema@cp3051.esams.wmnet:~
$ timeout --foreground 60 varnishncsa -g request -n frontend -q 'Storage eq "malloc Transient"' -F '%{Content-Length}o'| sort -n | awk '{v[NR]=$1} END { print "med", v[int(NR*.5)] ; print "p75", v[int(NR*.75)] ; print "p95", v[int(NR*.95)] }'
med 370661
p75 912115
p95 5036568

It might be worth playing a little with the current pass size cutoff for the simple size-based admission policy that we're using by default. Right now the cutoff is 256K, maybe increasing it a bit would help with transient usage without having a negative impact on hitrate?

Change 594896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594923 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594923 merged by Ema:
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594896 merged by Ema:
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594971 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: set large_objects_cutoff to 384K

https://gerrit.wikimedia.org/r/594971

Change 594971 merged by Ema:
[operations/puppet@production] cp3051: set large_objects_cutoff to 384K

https://gerrit.wikimedia.org/r/594971

Change 596430 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: lower large_objects_cutoff to 288K

https://gerrit.wikimedia.org/r/596430

Change 596430 merged by Ema:
[operations/puppet@production] cp3051: lower large_objects_cutoff to 288K

https://gerrit.wikimedia.org/r/596430

It might be worth experimenting with enabling request coalescing for large files. That could help reducing pressure on transient I think, worth giving it a shot next week.

Change 597008 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cp3051: end large_objects_cutoff experiment

https://gerrit.wikimedia.org/r/597008

Change 597008 merged by Ema:
[operations/puppet@production] cp3051: end large_objects_cutoff experiment

https://gerrit.wikimedia.org/r/597008

BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall changed the task status from Open to Stalled.Mar 16 2023, 7:41 PM
BCornwall subscribed.

@Vgutierrez, @BBlack: What's the status of this? It's three years old but I took a look at the transient memory and there are only a few instances where it maxed out, all of which were freed up shortly afterwards.

BCornwall claimed this task.

I haven't been able to see any indication that this has been an issue for the entirety of our metrics. @ema's great work likely has fixed this, so I'll mark as closed. Thanks, Ema!