Page MenuHomePhabricator

cache_upload cache policy + large_objects_cutoff concerns
Open, MediumPublic

Description

See parent task for context. Copying some bits of text from there:

The current value of large_objects_cutoff is too small for commonly-linked images. This is not easily analyzed or fixed without looking at some other complex considerations about the total cache size, object size distributions, and varnish's eviction behavior and the ways it can fail, so we'll come back to this in more elsewhere in the ticket I think. The current temporary mitigation touches on this factor, but in a conservative and limited way.
[...]
large_objects_cutoff and all related things - the key thing that makes tuning this hard is varnish's nuke_limit, which caps the number of objects it will evict to make room for one new object (without any regard to relative sizes). If it can't make room within nuke_limit evictions, the request fails. If there's a large disparity in the minimum and maximum object sizes stored in the cache, large objects could commonly fail because there are too many small objects needing eviction to make room. Setting nuke_limit arbitrarily-large can also hurt (thrashing of cache contents for one-hit-wonder large objects, slow responses), and setting large_objects_cutoff much larger without a nuke_limit increase causes failures. The current eqsin-only mitigation is working, but it will fail to work for slightly-larger images than the last incident. We've also had some more-dynamic policies in the past based on T144187, where instead of having a fixed cutoff, the probability of the object entering the cache decreases as the size increases. This allows us to say something like "don't normally cache rare large objects, but do eventually cache them if they're popular enough". However, the tuning of the parameters for that are now quite outdated, and were also based purely on improving object and/or byte hit-rates, not surviving overwhelming events. The previous tuning gave near-zero probability of cache entry for the sizes in question in our recent scenarios. However, we could use a similar solution that's tuned for this problem rather than for hitrates, maybe (same exponential curve, but with admission probabilities that suit the cases we care about here...). I'm still digging into this.

Event Timeline

So, to expand a little bit on the text quoted at the top with some initial insights about cutoff vs nuke-limit tradeoffs and some of my current thinking and/or assumptions:

  • Turnilo webrequest-128 data seems to confirm that we do, as expected, have a large volume of upload-cluster traffic for small-sized files (something like ~80% of requests return objects <= 64KB in size is what I'm seeing). This isn't exactly the data we want for comparison. We'd rather know the distribution of object sizes inside the cache itself over time, but this is harder to see!
  • Ideally if we want frontend coverage for all "reasonable" images, a good absolute cutoff would probably be closer to the ~64MB we're using on cache_text, vs the 4MB temporary fix currently in eqsin, and historical setting of 256KB.
  • With 195GB of storage in the frontends (all nodes in 2021 are at ~that size), there's only room for a grand total of ~3120 objects at that 64MB limit. Or another way of thinking about that, is that 1x 64MB object will consume 0.03% of the cache storage. This sounds small-ish, but it's really not. 1000 such objects could consume ~1/3 of the cache, which normally holds probably hundreds of thousands of objects, or maybe low millions. If we cache all such objects, even when they're not hot, it will cause a lot of churn and pointless eviction of hot smaller objects for rare cold objects that happen to be, say, 50MB in size. This will exacerbate pressures on varnish CPU and lock contention and also give us a lower object hitrate.
  • The constraints above seem to push us into an unwinnable corner on the surface. We can tune the parameters to fix one problem, but it will cause some other problem later. (this is also the case for the related pass_random stuff - we've turned it on for upload before, and it solves the one-node-focus problem for misses/passes to the backend layer, but it also cuts our total cache storage dramatically...).
  • One way out of this is to look at a dynamic strategy:

We have such a strategy in our puppetized VCL, currently unused, called "exp" - https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#914 (and see also the definition of adm_param near the top of the file). This policy admits files into cache storage based on a probabilistic function with an exponential response. This gives us a probability-based solution that is very likely to cache smaller objects the first time they're seen, but less-likely to cache larger objects unless they're seen enough times to make it past a probability filter (e.g. a 64MB file might haven 8% chance to enter cache each time we see it, or whatever).

The tuning parameters for the function, as presently puppetized, auto-tune themselves to maximize object hitrate given our total cache storage size as a parameter. Some of the other fixed parameters that contribute to the calculation of the adm_param were based on research from our webrequest data years ago to figure our size and popularity distributions, and might still be "close", but are no longer current.

In practice, with the parameters that are puppetized there today, it skews heavily in favor of avoiding the caching of large objects in the ranges we care about for our failure scenario, again because it wasn't driven by defending against this "attack", but instead by maximizing hitrates. I wrote a small C program to reproduce the math accurately to see the numbers. The current admission probabilities (1.0 being "always cache on first hit" and 0.0 being "never cache this") for various sizes with these parameters and 195GB of storage look like:

bblack@haliax:~$ cat testme.c
#include <stdlib.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
    const double fe_mem_gb = 195.0;
    const double adm_param = pow(fe_mem_gb / 1024.0, 0.9064392) / pow(2.0, -18.16135);
    const double size = strtod(argv[1], NULL);
    const double clen_neg = -1.0 * size;
    const double prob = exp(clen_neg/adm_param);
    printf("Size: %f Prob: %f\n", size, prob);
    return 0;
}
bblack@haliax:~$ for x in 1024 16384 65536 262144 1048576; do ./testme $x; done
Size: 1024.000000 Prob: 0.984417
Size: 16384.000000 Prob: 0.777792
Size: 65536.000000 Prob: 0.365977
Size: 262144.000000 Prob: 0.017940
Size: 1048576.000000 Prob: 0.000000

So even at 256KB it's only giving a ~2% chance of cache entry, and by 1MB the odds are virtually zero.

For our defensive purposes here, though, I think we could propose (a) Setting the large_objects_cutoff at 64MB just as a safety valve (in case of bugs in the exp policy, or needing to disable it later) + (b) Picking a fixed value for adm_param that results in reasonable probabilities in the ranges we care about. Something in the ballpark of ~5% odds around the 64MB mark, and approaching 100% by at least the time we get down to somewhere in the 128K-256K-ish range, which gets us close to the fixed behavior we had before unless the object has some level of popularity. From there we can probably tune the nuke_limit to where LRU nuke failures are rare enough that we're ok with the tradeoff.

jbond triaged this task as Medium priority.Feb 26 2021, 1:43 PM
jbond subscribed.

Following up a bit on other paths through this problem:

From there we can probably tune the nuke_limit to where LRU nuke failures are rare enough that we're ok with the tradeoff

Obviously the ideal would be to have a solution that will not knowingly, even occasionally, create LRU nuke failures (which present as user-facing 5xx). It's actually possible to achieve that goal using varnish's default file/malloc storage, and we've done it before, for file storage, back when we had varnish as the backend cache daemon as well. The secret is you have to split up your storage into size-based bins to ensure that you can cover a broad range of object sizes without ever running afoul of nuke_limit. For example, you can choose that nuke_limit has a fixed value of 256, and then ensure that for every given storage pool, there are minimum and maximum sizes which are no further apart than a factor of 256, like:

poolminmax
pool11256
pool225764K
pool364K+116MB
pool416MB+14GB

In each pool above, even if it were full of only minimum-sized objects, it could never take more than 1000 evictions to make room for a maximally-sized one, thus LRU nuke failure should be a non-issue.

If we're ok with nuke_limit of 1000, it gets even easier to reach the 1GB mark (which is the ultimate case beyond which we don't cache at all, even in the backend layer), with only 3 pools:

poolminmax
pool111000
pool210011MB
pool31MB+11GB

Here it's maybe worth diving into an aside about tiny objects:

You could see a reasonable person looking at the table above and arguing that this should probably reasonably be just two pools, because surely there aren't so many tiny outputs to worry about that the first pool needs to exist separately. However, turnilo says that in terms of status=200 responses served from upload, ~10% of responses are under 256 bytes, and ~29% are under 1K (think minimized PNGs or even SVGs of simple diagrams/icons, etc). We'd rather, again, know the distribution of objects in cache rather than responses served, but that's a harder number to come by. Still, I think the response data is enough evidence to show that the small objects are not an edge case we can ignore.
Personally, I think this raises design issues beyond the scope of this ticket for the long term. The text/upload split we have today (which exists as the cache cluster distinction, but also critically as the upload.wikimedia.org DNS-level distinction) always seems to make intuitive sense at first glance in terms of "oh, little text outputs vs big media files", but we have fairly-large wikitext outputs and very tiny images in play as well. It would probably make more architectural sense to have all the small stuff from cache_upload actually be on a hostname served from cache_text, and split out only the truly-large cases for a different architectural treatment (which might not even be based on the generic caching stack we have today!) - a lot of the conflicts we're dealing with here come down to how broad a range of output sizes cache_upload has to deal with!

The problem that arises with the pooled solution, though, is that we then have to pick fixed storage sizes for the pools to carve up the available ~195GB into (which are set at daemon start; any change to these parameters restarts the daemon and wipes the cache). You can study the data and try to empirically come up with a split that represents the "average" case. Basically, your split could maximize long term average hitrate for last week or last month's data. However, this won't be responsive to ever-changing patterns, or even intra-day patterns as different parts of the world sleep, the way a more general solution with a single pool is. It also implicitly creates biases that are difficult to understand, in the preference of byte-hitrate vs object-hitrate if our metric is merely the overall hitrate. Even then, the math of the split for optimal hitrates may leave you with some questionable bin sizing in practice (in terms of how many total objects are likely to fit in a given bin). The questions quickly get pretty complex!

It creates all kinds of dynamic inefficiencies where some pools are being underutilized while others are suffering from eviction pressure on truly-hot objects needlessly. It also seems to go down the road of "create even more inexplicable complexity and future difficulties" vs something that's more-intuitive to understand like a size-based probability of insertion. Especially with a smaller memory-only pool like the frontends have, it doesn't seem like the optimal road to go down, at least to me in the moment in response to this current scenario. It's a thought worth exploring, though, and maybe I'm just not thinking right about this and/or this helps someone else think of something else novel.

We have such a strategy in our puppetized VCL, currently unused, called "exp" - https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#914 (and see also the definition of adm_param near the top of the file). This policy admits files into cache storage based on a probabilistic function with an exponential response. This gives us a probability-based solution that is very likely to cache smaller objects the first time they're seen, but less-likely to cache larger objects unless they're seen enough times to make it past a probability filter (e.g. a 64MB file might haven 8% chance to enter cache each time we see it, or whatever).

The tuning parameters for the function, as presently puppetized, auto-tune themselves to maximize object hitrate given our total cache storage size as a parameter. Some of the other fixed parameters that contribute to the calculation of the adm_param were based on research from our webrequest data years ago to figure our size and popularity distributions, and might still be "close", but are no longer current.

In practice, with the parameters that are puppetized there today, it skews heavily in favor of avoiding the caching of large objects in the ranges we care about for our failure scenario, again because it wasn't driven by defending against this "attack", but instead by maximizing hitrates.

Last time we looked into the "exp" policy, our experiment was focused on evaluating the parameters provided by simulation data in terms of hitrate: T144187#6116181. The conclusion was that with the values under consideration the "static" policy was performing better. The idea was then to continue those experiments in order to find the magic values that would give hitrate figures similar to the static policy (something we never did BTW).

Among the various goals of a CDN there's (a) reducing user-perceived latency and (b) shielding origin servers from excessive load. Assuming all requests are equal in terms of cost of generating them for the origins, trying to increase the hitrate is a good approach to achieve both goals. However, in practice some requests are more costly for the origins than others. If a given policy change increases hitrate from, say, 97% to 99%, but makes a class of very expensive requests uncacheable, then the change improves (a) at the expenses of (b) -- it improves performance at the expense of stability.

There's another cost to be taken into account, and that's how expensive it is for the cache to cache the object in terms of number of evictions necessary to make room for it. The beauty of the problem we had (T274888) is that both the "origin cost" and "cache cost" were due to the same factor, namely object size. Using the "exp" caching policy (which takes size into account) hence might work if tuned appropriately to protect the origins at the expenses of hitrate. In the general case however, there might be no indication in the response about how much strain generating the response imposed on the lower layers. A better approach to (b) going forward might need to involve an assessment of origin cost at the lower layers, to be somehow taken into account by the caching policy -- together with the caching cost (let's call it nuke cost or something).

For our defensive purposes here, though, I think we could propose (a) Setting the large_objects_cutoff at 64MB just as a safety valve (in case of bugs in the exp policy, or needing to disable it later) + (b) Picking a fixed value for adm_param that results in reasonable probabilities in the ranges we care about. Something in the ballpark of ~5% odds around the 64MB mark, and approaching 100% by at least the time we get down to somewhere in the 128K-256K-ish range, which gets us close to the fixed behavior we had before unless the object has some level of popularity. From there we can probably tune the nuke_limit to where LRU nuke failures are rare enough that we're ok with the tradeoff.

This sounds reasonable, though I think we need to define the acceptable hitrate penalty we're willing to take. Also, do we know why reaching nuke limit results in a 503? Couldn't varnish just pass instead in that case, assuming enough space in transient memory to handle the request?

Change 677814 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add script to test exp policy offline

https://gerrit.wikimedia.org/r/677814

Change 677814 merged by Ema:

[operations/puppet@production] varnish: add script to test exp policy offline

https://gerrit.wikimedia.org/r/677814

Change 677820 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677820

Change 677820 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677820

Mentioned in SAL (#wikimedia-operations) [2021-04-08T09:20:30Z] <ema> cp5001: varnish-frontend-restart to test exp policy settings starting from a empty cache T275809

Change 677718 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677718

Change 677718 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677718

Mentioned in SAL (#wikimedia-operations) [2021-04-08T12:49:38Z] <ema> cp5001: varnish-frontend-restart to test exp policy settings starting from a empty cache T275809

Today I've added exp_policy.py, a trivial script to show the probability of caching objects of various sizes with the "exp" policy. Based on that script, I found that with rate=0.1 and base=-20.3 we get the following probabilities for a 384G cache:

obj sizeadmission
1.0 KB99.9%
2.0 KB99.8%
4.0 KB99.7%
8.0 KB99.3%
16.0 KB98.6%
32.0 KB97.2%
64.0 KB94.6%
128.0 KB89.4%
256.0 KB79.9%
512.0 KB63.9%
1024.0 KB40.8%
2048.0 KB16.7%
4096.0 KB2.78%

The values looked reasonable, in that small-ish objects are cached almost certainly, and large objects (~4M) have a fair chance to make it if they're popular (2% admission probability). After emptying the cache and applying the policy changes, cp5001 is now back at about 87% frontend hitrate, which is in line with all other upload@eqsin nodes still using the static "cache all below size threshold" policy.

Screenshot from 2021-04-08 18-07-58.png (1×2 px, 175 KB)

Interestingly, the shape of the fill-up curve with "exp" looks more gradual -- compare the growth between 9:30 and 12:00 with the static policy with 13:00 16:00.

Screenshot from 2021-04-08 18-14-00.png (1×2 px, 164 KB)

The total number of objects in cache is also so far higher than on all other upload@eqsin caches.

Screenshot from 2021-04-08 18-16-36.png (1×2 px, 195 KB)

Change 678788 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: explicitly list object sizes in exp_policy.py

https://gerrit.wikimedia.org/r/678788

Change 678790 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on upload@eqsin

https://gerrit.wikimedia.org/r/678790

Change 678788 merged by Ema:

[operations/puppet@production] cache: explicitly list object sizes in exp_policy.py

https://gerrit.wikimedia.org/r/678788

Change 678790 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on upload@eqsin

https://gerrit.wikimedia.org/r/678790

Mentioned in SAL (#wikimedia-operations) [2021-04-13T09:41:39Z] <ema> cp[5002-5006]: rolling varnish-frontend-restart to apply exp policy settings changes starting from empty caches T275809

cp5001 has been running with the exp policy for 5 days now: compared to other upload nodes in eqsin, its hitrate is higher (~3%) and it is storing significantly more objects (~5.8M vs ~3.9M). Extending the policy change to the rest of upload@eqsin.

Apparently we do occasionally reach nuke_limit on cache_upload nodes in DCs other than eqsin -- unrelated to the exp policy, we haven't moved away from the static "size threshold" policy anywhere other than Singapore yet.

Screenshot from 2021-04-14 17-24-42.png (1×2 px, 144 KB)

When that happens, affected requests are only partially sent to the client, see T266373. We should raise nuke_limit, currently 50 (default) on cache_upload.

Change 679364 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache_upload: set nuke_limit to 1000

https://gerrit.wikimedia.org/r/679364

Change 679364 merged by Ema:

[operations/puppet@production] cache_upload: set nuke_limit to 1000

https://gerrit.wikimedia.org/r/679364

I've added a dashboard called Varnish Anomalies, currently plotting when nuke_limit is reached, as well as Varnish fetch failures. On non-eqsin upload there's basically a one-to-one match between the two at the moment:

Screenshot from 2021-04-15 11-02-46.png (1×2 px, 186 KB)

Raised nuke_limit to 1K on all cache_upload to fix that.

Mentioned in SAL (#wikimedia-operations) [2021-04-15T09:16:05Z] <ema> cp-upload: varnishadm -n frontend param.set nuke_limit 1000 T275809

Change 680197 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on upload@ulsfo

https://gerrit.wikimedia.org/r/680197

Change 680197 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on upload@ulsfo

https://gerrit.wikimedia.org/r/680197

Mentioned in SAL (#wikimedia-operations) [2021-04-16T07:41:09Z] <ema> cp-upload_ulsfo: rolling varnish-frontend-restart to apply exp policy settings changes starting from empty caches T275809

Change 680995 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on cp3051

https://gerrit.wikimedia.org/r/680995

Change 680995 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on cp3051

https://gerrit.wikimedia.org/r/680995

Mentioned in SAL (#wikimedia-operations) [2021-04-19T09:56:54Z] <ema> cp3051: varnish-frontend-restart to apply exp policy settings changes starting from empty cache T275809

Change 681064 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: decrease caching likelihood on cp3051

https://gerrit.wikimedia.org/r/681064

Change 681064 merged by Ema:

[operations/puppet@production] cache: decrease caching likelihood on cp3051

https://gerrit.wikimedia.org/r/681064

Change 694438 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: print exp_policy_base and exp_policy_rate

https://gerrit.wikimedia.org/r/694438

Change 694440 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: add command line parameters to exp_policy.py

https://gerrit.wikimedia.org/r/694440

Change 694438 merged by Ema:

[operations/puppet@production] cache: print exp_policy_base and exp_policy_rate

https://gerrit.wikimedia.org/r/694438

Change 694440 merged by Ema:

[operations/puppet@production] cache: add command line parameters to exp_policy.py

https://gerrit.wikimedia.org/r/694440

Change 697538 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: use 'exp' admission policy on cache_upload

https://gerrit.wikimedia.org/r/697538

Change 697538 merged by Ema:

[operations/puppet@production] cache: use 'exp' admission policy on cache_upload

https://gerrit.wikimedia.org/r/697538