Page MenuHomePhabricator

Explicitly limit varnishd transient storage
Closed, ResolvedPublic

Description

We've got data on the Transient SMA's g_bytes allocations in prometheus now. Let's look at the data and set some healthy upper bounds (possibly per-layer or per-cluster if they differ substantially) that leaves a little room for growth/spikes? Unbounded transient allocation has caused us problems in the past.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ema triaged this task as Medium priority.May 9 2017, 8:35 AM
ema moved this task from Triage to Caching on the Traffic board.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.
Qse24h closed this task as a duplicate of T164723: New git repository: <repo name>.

Note that the limit cannot be set using a configuration parameter but rather by defining a storage backend named "Transient".

For example: -s Transient=malloc,1G.

See https://www.varnish-cache.org/docs/4.1/users-guide/storage-backends.html#transient-storage.

The following graph plots max by (job, layer) (varnish_sma_g_bytes{type="Transient"}): https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage

Change 353274 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: limit varnishd transient storage

https://gerrit.wikimedia.org/r/353274

Change 353567 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: be careful about grace/keep on 0-TTL objects...

https://gerrit.wikimedia.org/r/353567

I've queried prometheus as follows to find the maximum transient storage usage per cache_type/layer over the past 30 days:

max by (job,layer) (max_over_time(varnish_sma_g_bytes{type="Transient"}[30d])) / 1000000

Methodology

ssh -L 8000:localhost:80 prometheus1003.eqiad.wmnet
ssh -L 8001:localhost:80 prometheus2003.codfw.wmnet
ssh -L 8002:localhost:80 bast3002.wikimedia.org
ssh -L 8003:localhost:80 bast4001.wikimedia.org

Open: http://localhost:8000/ops/graph?g0.range_input=1h&g0.expr=max+by+(job%2Clayer)+(max_over_time(varnish_sma_g_bytes%7Btype%3D%22Transient%22%7D%5B30d%5D))+%2F+1000000&g0.tab=1

Port 8000 is eqiad, you'll have to change it to get data about other DCs.

Results

All values in megabytes.

Text

layereqiadcodfwesamsulsfo
frontend143392224591542
backend850733702541

Upload

layereqiadcodfwesamsulsfo
frontend34763141455240736174
backend2196794744252530856

Misc

layereqiadcodfwesamsulsfo
frontend375784313691537
backend276581106144951210

Proposed caps

On text, transient storage usage seems pretty reasonable; we could cap as follows, leaving plenty of room for spikes:

cache_typelayercap
textfrontend5G
textbackend2G

When it comes to upload and misc, instead, we should try to find out why transient storage usage grows up to > 50G in certain circumstances before setting a cap.

When it comes to upload and misc, instead, we should try to find out why transient storage usage grows up to > 50G in certain circumstances before setting a cap.

I've tweaked the varnish-transient-storage-usage dashboard adding some templating to choose a specific DC/cache_type/layer.

Taking esams/upload/frontend as an example, cp3035 stands out: it used ~25G of transient memory yesterday for some 10 minutes. Except for an increase in network activity around 23:44, I don't see other stats correlating with the sharp increase in transient memory usage, making me think that we might be missing some important varnish metrics in the varnish-machine-stats?

On text, transient storage usage seems pretty reasonable; we could cap as follows, leaving plenty of room for spikes:

cache_typelayercap
textfrontend5G
textbackend2G

+1 :)

When it comes to upload and misc, instead, we should try to find out why transient storage usage grows up to > 50G in certain circumstances before setting a cap.

Agreed. We do need to get transient total size under control on all the clusters in the long run, so we can make efficient frontend memory allocations without worrying about spikes leading to OOMs on the machine and crashing/killing varnishd. When I've looked in detail before, these peaks tend to be very spiky and short-lived. I've suspected it's some form of buffering on large and/or uncacheable content, and/or related to the few cases where we turn off do_stream in misc and upload. It needs a lot more digging. Some salient points of guessing/reference to perhaps trigger thinking about the problem from the right angles...

  1. cache_misc still has a do_stream = false case on the backend-most, perhaps this should be limited to cacheable responses?
  1. We know the most-common/documented causes of transient usage are hit-for-pass objects and shortlived responses (under 10s in our config I believe?). For the HFP case: are there cases where we're generating excessive amounts of these, perhaps by Varying on cookies while creating hit-for-pass objects? For the shortlived case: perhaps this happens when we fetch large cacheable objects which happens to be < 10s away from their natural Age expiry?
  1. In general, how does buffering input speed work in V4? We know for cacheable content, it buffers to the actual cache storage (to create an eventual long-term object) and no longer limits parallel coalesced clients to the slowest client's speed, thanks to the V4 be/fe -splitting work within the architecture of one daemon. In these cases, do you think it's fetching from the next backend at the speed of the fastest coalesced client, or at full possible speed independent of all client download speeds?
  1. What happens in the case of an uncacheable response (which is necessarily only streaming through to a single client connection)? Does a daemon fetch from its backend at full speed into transient storage as a buffer, then free the transient space once the client has consumed it? (and then does it delete the whole object once at the end, or delete progressively as the client fetches?). Or does it fetch at the client's speed and keep the buffering minimal?
  1. Can bad coalesce queueing on uncacheable responses exacerbate anything above? We may have cases where the response headers indicate uncacheability but we're failing to either be in explicit pass mode or immediately create a hit-for-pass object, resulting in multiple clients queueing and serially fetching their personal variants of the uncacheable object. Does it recognize this case as the headers for each response come through (thus serializing the header-fetch part, but mostly-parallelizing the response body fetches for the queue?), resulting in many parallel transient allocations for the coalescing clients here?

Change 353274 merged by Ema:
[operations/puppet@production] varnish: limit varnishd transient storage size

https://gerrit.wikimedia.org/r/353274

Just a few (partial) answers so far, but here we go!

  1. cache_misc still has a do_stream = false case on the backend-most, perhaps this should be limited to cacheable responses?

Yep. Streaming uncacheable responses seems like a good idea regardless of transient storage.

  1. We know the most-common/documented causes of transient usage are hit-for-pass objects and shortlived responses (under 10s in our config I believe?). For the HFP case: are there cases where we're generating excessive amounts of these, perhaps by Varying on cookies while creating hit-for-pass objects?

I was about to say that there doesn't seem to be a correlation between hfp creation rate and transient storage usage spikes. However, we're plotting cache_hitpass, which is the number of hfp hits. There doesn't seem to be a varnish-counter tracking hfp creation, so this is certainly something to look into.

For the shortlived case: perhaps this happens when we fetch large cacheable objects which happens to be < 10s away from their natural Age expiry?

With our current configuration, an object is considered shortlived when ttl+keep+grace is < 10s. Given our keep/grace settings I'd imagine this basically never happens? That being said, there's no varnish-counter tracking the creation of shortlived objects too, which would be great to have to confirm this.

Side note: I've run this to observe live transient storage usage. RespStatus 200 to avoid seeing all redirects, pybal/varnish checks excluded as they all cause synth responses:
varnishlog [-n frontend] -q 'Storage ~ "Transient" and ReqMethod eq "GET" and RespStatus eq 200 and ReqURL ne "/check" and ReqURL ne "/wikimedia-monitoring-test" and ReqURL ne "/from/pybal"'

Surprisingly, after a couple of minutes on upload-esams, 0 requests have matched. Perhaps uncacheable responses are served from transient storage without that information being logged to VSM? Another thing to further look into.

Change 361845 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] 4.1.6-1wm2: new varnish-counters for transient storage

https://gerrit.wikimedia.org/r/361845

Change 361845 merged by Ema:
[operations/debs/varnish4@debian-wmf] 4.1.7-1wm1: new upstream, new counters

https://gerrit.wikimedia.org/r/361845

Mentioned in SAL (#wikimedia-operations) [2017-06-29T14:30:19Z] <ema> varnish 4.1.7-1wm1 uploaded to apt.w.o, cp1008 upgraded T164768

As of yesterday, varnish 4.1.7-1wm1 is deployed on all cache hosts. It includes our patch adding two counters, one for shortlived objects creation and another for uncacheable objects. I've added both counters to the varnish-transient-storage-usage dashboard.

We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here). I'm a little concerned about the interplay of unbounded transient mem spikes and NUMA in the new cp4's, although I think so long as they're happening on backends we're probably fine (even better than the non-NUMA case). A big upload frontend transient spike will likely oomkill the NUMA-isolation nodes much easier than before...

We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here).

Correct, only text is capped for now, see usage of the hiera attributes cache::{fe,be}_transient_gb.

ema closed this task as Resolved.EditedJul 22 2020, 3:02 PM
ema claimed this task.

Both text and upload now have limited transient:

hieradata/role/common/cache/text.yaml:profile::cache::varnish::frontend::transient_gb: 5
hieradata/role/common/cache/upload.yaml:profile::cache::varnish::frontend::transient_gb: 10

Closing.