Page MenuHomePhabricator

Better handling for one-hit-wonder objects
Open, MediumPublic

Description

One of the sources of cache hit-rate inefficiency is one-hit-wonder objects. These are objects which are cacheable and get cached on usage, but tend to only get hit once in a blue moon (or once ever). They're still going to evict some other object based on LRU in a tight cache (especially our frontends), and that's usually the wrong choice. The hard part is "predicting" that a missed object will end up being a one-hit-wonder.

The first and most-general approach to this, that we've discussed (mostly offline from phab) in the past is to employ a Bloom Filter. The filter itself would be a fairly large memory object, but small compared to the actual dataset, and would allow us to tell the difference between "first miss ever" and "a miss on an object we've missed before", and effectively prevent caching an object the first time it's ever accessed (waiting for the second access to cache it). I first stumbled on this idea in: https://people.cs.umass.edu/~ramesh/Site/HOME_files/CCRpaper_1.pdf . @ema found this that's probably relevant too on Cuckoo Bloom filters being a better implementation choice: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf .

I had a new idea today, though:

While the Bloom type approach is more general and works for a single cache layer, we already have additional metadata available to us in our frontends (the smallest caches, where this matters most) because of our layered caches. All requests passing through our frontends inevitably go through one or more backend cache layers with much larger storage size. We communicate the miss/hit status from the backend to the frontend on response in the X-Cache-Int header. Therefore we could potentially implement logic in our frontend VCL of the form "If the miss-fetch from the backend indicates the backend also missed, do not cache the object in the frontend". On the second fetch, it would be a hit in the backend (from the earlier miss->fetch->store there) and then also get cached in the frontend as a twice-seen object. This has an upside versus a local bloom filter in each frontend, too: the frontends are effectively sharing (at no cost) the twice-seen dataset (so if it's seen for the first time ever in Frontend1, but Frontend2 had seen it recently, it wouldn't get counted as one-hit-wonder, like it would with local Bloom filters).

There are pragmatic bits and pieces to sort out for this idea:

  1. Ideally, we should hook in the notion of varnishd uptime, so that the backend-miss-based one-hit-wonder filter doesn't kick in until a given cache daemon has been online and serving requests for longer than X minutes, to avoid doubling the request volume to the backend when warming up a cold cache. Uptime alone doesn't cover all such cases (e.g. depooled frontends...), but it would cover a lot of the common ones.
  2. How exactly in v3 or v4 do we make a given one-hit-wonder request act in every way like a normal cacheable response but not touch the cache storage? The answer is probably obj.ttl=0s + beresp.uncacheable and similar, but we need to be sure.

This would probably be superior to local Bloom for frontends, and about the same as local Bloom for intermediate backends where chashing is already in effect. This approach doesn't work for backend-most backends which talk directly to the applayer; they'd need a real filter to solve their own one-hit-wonder problem.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+13 -0
operations/puppetproduction+8 -8
operations/puppetproduction+3 -9
operations/puppetproduction+0 -32
operations/puppetproduction+32 -0
operations/puppetproduction+13 -12
operations/puppetproduction+18 -11
operations/puppetproduction+20 -21
operations/puppetproduction+19 -26
operations/puppetproduction+4 -49
operations/puppetproduction+18 -6
operations/puppetproduction+2 -0
operations/puppetproduction+4 -1
operations/puppetproduction+0 -2
operations/puppetproduction+2 -0
operations/puppetproduction+142 -8
operations/puppetproduction+29 -0
operations/puppetproduction+5 -10
operations/puppetproduction+21 -0
Show related patches Customize query in gerrit

Event Timeline

BBlack created this task.Aug 29 2016, 3:03 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 29 2016, 3:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack updated the task description. (Show Details)Aug 29 2016, 3:04 PM

I have been working for some time with Ramesh on cache admission policies for Akamai workloads, and I can contribute the following takeaways

  1. frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonders but also two-hit-wonders etc. I have often seen optimal N values between 4 and 16 - though, I cannot yet say what works best for WMF's workloads. (N=1 is easy to implement when one chooses Bloom filters, but N>1 requires more work as "counting" Bloom filters are a lot harder to manage.)
  2. a new idea that simplifies frequency-based admission is to use probabilistic admission: admit with probability 1/N. In expectation (geometric distribution), this means that objects would get admitted after N requests without any need for bookkeeping (data structures etc). In our experiments, this achieved 98% of the hit ratio of frequency-based admission.
  3. for many CDN caching hierarchies, it makes sense to focus the front/memory cache on maximizing the object hit ratio (OHR) whereas the back/disk cache focuses on the byte hit ratio (BHR). This essentially means that the front/memory cache needs to cache mostly small objects, which significantly decreases the eviction volume so that it serves a higher fraction of requests (but possibly fewer bytes). For the back/disk cache, this has the advantage that there are fewer random reads and that the average requests is larger, which additionally increases sequentiality of disk/ssd reads and improves throughput.

Recently, we have been combining ideas 2. and 3. using an admission probability that is larger for small objects and (much) smaller for large objects. This turns out to be very effective but requires tricky parameterization. We've been experimenting with automatically and continuously adapting the parameter to the request traffic using some math models.
(This worked very well on Akamai production workloads, see a tech report here http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-120.pdf)


I'm happy to run a few simulations to compare some alternatives once T128132 is ready.

I'd really like to include BBlack's new idea of exploiting the backend X-Cache, but I feel like this would require a huge simulation including all of WMF's traffic for a week or so (to simulate all of the backend). Therefore, I'm not sure how to incorporate this into a comparison - but I'd really like to.
Maybe the only way to do this is to prototype a few variants, deploy and do A/B testing? Other ideas?

(With regard to the second question in this phabricator item's description: yes (obj.ttl=0s + beresp.uncacheable) is correct. We've been experimenting a lot with Varnish + admission control, and that's what worked very well for us.)


OHR#reqs served by cache / total #reqs
BHR#bytes server by cache / total #bytes requested
BBlack added a comment.EditedAug 31 2016, 12:10 PM

frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonders but also two-hit-wonders etc. I have often seen optimal N values between 4 and 16 - though, I cannot yet say what works best for WMF's workloads. (N=1 is easy to implement when one chooses Bloom filters, but N>1 requires more work as "counting" Bloom filters are a lot harder to manage.)

Interesting! I wouldn't have expected such high numbers. In any case, our X-Cache data happens to also count hits, so we could extend it to "only cache in the frontend when the backend reports hit/4 or higher" and similar, experimentally.

  1. for many CDN caching hierarchies, it makes sense to focus the front/memory cache on maximizing the object hit ratio (OHR) whereas the back/disk cache focuses on the byte hit ratio (BHR). This essentially means that the front/memory cache needs to cache mostly small objects, which significantly decreases the eviction volume so that it serves a higher fraction of requests (but possibly fewer bytes). For the back/disk cache, this has the advantage that there are fewer random reads and that the average requests is larger, which additionally increases sequentiality of disk/ssd reads and improves throughput.

We already have a very naive and static version of this on the upload (image) cluster - we just have a conditional in place that refuses to cache objects in the frontend if they're >= X size. X was traditionally 32MB. I raised it to 50MB a while back, and we experimentally observed that it didn't help OHR at all (as I guess we should've expected, but we have so little data about our dataset, it's sometimes nice to verify!). @ema and I were just talking yesterday about going back the other direction now and experimentally lowering it until we see a positive impact on frontend OHR, but we're in the midst of other changes right now, so it's put off for a better window to see results clearer.

mark added a subscriber: mark.Aug 31 2016, 12:18 PM

Tying the X-Cache idea together with the tuneable hitcount and size from the paper, we could look at something of the form if (hc / (os * X) < Y) { uncacheable; }, where hc is backend hitcount, os is object size, and then X and Y are our tunables. Probably slightly more magic than that on the object-size front, but that would be the general idea.

Change 308995 had a related patch set uploaded (by BBlack):
cache_upload: one-hit-wonder experiment, hit/2

https://gerrit.wikimedia.org/r/308995

fgiunchedi removed a subscriber: fgiunchedi.

Change 308995 merged by BBlack:
cache_upload: two-hit-wonder experiment, hit/2

https://gerrit.wikimedia.org/r/308995

Change 311726 had a related patch set uploaded (by BBlack):
N-hit-wonder: 4-hit, and improve filtering

https://gerrit.wikimedia.org/r/311726

Change 311726 merged by BBlack:
N-hit-wonder: 4-hit, and improve filtering

https://gerrit.wikimedia.org/r/311726

I finally got some simulation results.

I'm comparing the following seven caching policies:

Name of PolicyDescription of Policy
LRUPure least-recently-used eviction (w/o any admission policy) - an approximation of what Varnish does by default
FilterLRU + frequency-based admission, i.e., "admit after N requests"
ThresholdLRU + size-threshold admission, i.e., "admit if size below threshold
Exp-sizeLRU+ probability-size admission, i.e., "admit with probability exp{-size/c}" (mixtures of frequency and size-threshold admission)
GDSGreedy-dual-size eviction (complicated eviction policy, which serves as a baseline)
GDSFGreedy-dual-size-frequency eviction (even more complicated eviction policy, which serves as a baseline)
Infinite CacheCache with infinite capacity

My simulation assumes:

  • 64 GB cache size (I know that's a bit small compared to what's deployed)
  • cp4006's front memory cache request traffic from T128132
  • best parameters for Filter, Threshold, and Exp-size (see below)

Results:

Table:

Name of Policyhit ratioimprovement over LRU
LRU0.74486940%
Filter(N=32)0.80658818.3%
Threshold (threshold=64KB)0.892795219.9%
Exp-size (c=32KB)0.913446622.6%
GDS0.912129922.5%
GDSF0.927099924.5%
Infinite Cache0.979538831.5%

Best Filter parameter: N = 32


Best Size Threshold parameter: Threshold = 64 KB


Best Exp-size parameter: c = 32 KB

Happy to take input what to simulate next. Each simulation run takes about 48h.

  • 128 GB front mem cache? what's the actual size of cp4006?
  • disk cache (750 GB)?
  • more caching policies (I have a ton more, and I'm willing to implement more)?

I can share the simulator code, if there's interest in it. It's just a bit messy and I'd like to clean that up before putting it up on github. I'm happy to mail an archive with the current version, though.

I'm happy to share a vmod for Filter and Exp-size (for Varnish 4.1). Simple implementation (using probabilities instead of bloom filters, no memory overhead, CPU overhead is a few instructions).

We also have a self-tuning variant of the Exp-size policy (so every cache finds the optimal c by its own and adapts it over time). This has some overhead, though, and my current implementation makes a few changes to Varnish code. If there is interest, I'm happy to try it.

ema moved this task from Triage to Caching on the Traffic board.Sep 30 2016, 2:29 PM
BBlack added a comment.Nov 1 2016, 6:26 PM

@Danielsberger Somehow I missed these updates a month ago!

In the meantime, what we did put in place for cache_upload a while back (based on rough guesswork and tuning) was a combination of:

  • N=5 (admit after 5 requests, using our backends' headers to find the limit, so the parameter is shared between all frontends in the same DC)
  • Size=256KB (keep in mind the bulk of our caches have ~100GB of frontend storage currently)

Both of these helped, and we're currently seeing an average frontend hitrate of ~85.5%, so that roughly jives with the ballpark of your numbers. We're combining the good effects of two filters here, but also using slightly more permissive filters than what the graphs show as optimal, and there's more memory available.

The exp(-size/c) filter sounds really interesting, as it could be fairly low-complexity way to see some good gains. We might be able to crop out some of its edge cases too, by combining it with some fairly permissive pre-filtering (e.g. N=3 + Size=4MB) and then letting it take over the complex tradeoffs in the middle ground.

Could you run your same simulation as above, but at 100G and 200G for total storage size? I'm curious what the shape of the change is (both in terms of achievable hitrate, and in how the optimal parameters vary).

Ok, here are the new results for cache sizes between 50GB and 400GB. For now, I only looked at the Filter and Exp admission policies.

Disclaimer: The two week memory trace might be too short to trust the large-cache results (>256GB) results.

Filter (N-request admission)

Hit ratio vs Filter parameter N (color: different cache sizes):

Optimal Filter parameter N over cache size (log-size scale):

Exp (admission probability exponentially decreasing with size)

Hit ratio vs Filter parameter c (color: different cache sizes):

Optimal Filter parameter c over cache size (log-log scale):

A simple linear fit on this log-log scale

OPT_c_param = size^rate/2^base

gives us the following parameters

rate = 0.9064392 
base = -18.16135

My simulator code is now at github/dasebe/webcachesim.
To reproduce these plots

  • get and compile the simulator
  • compile the wmf trace parser and parse a few days from T128132
g++ -o rewrite_trace_wmf -std=gnu++11 helpers/rewrite_trace_wmf.cc
./rewrite_trace trace.txt cache.07.01.txt cache.07.02.txt ...
  • run the simulator (e.g., using gnu parallel and using this script )
  • use your favorite plot tool to reproduce the plots (e.g., R, libraries "data.table" and "ggplot2", using this script )
  • if anything goes wrong, feel free to create an issue on github
Joe added a subscriber: Joe.Nov 7 2016, 11:28 AM

Here's some VCL/inline c that implements the exp-size amission policy for Varnish.

Here's a graph of the hit ratio for various cache sizes.

It seems to me that this tells us: investing in a better admission policy pays off more than accurately sizing the mallocs to squeeze out capacity.

BBlack added a comment.Nov 8 2016, 6:07 PM

@Danielsberger - Fascinating stuff, thanks so much for running all of this data! We're kind of swamped in various things right now (varnish4 transition and various nginx/openssl changes), but I think maybe next week we can start rolling out some changes that will let us do some testing on these policies and tunables.

elukey added a subscriber: elukey.Dec 6 2016, 3:29 PM

Change 379512 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

Change 386192 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-fe

https://gerrit.wikimedia.org/r/386192

Change 379512 abandoned by Ema:
VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

Change 386192 merged by Ema:
[operations/puppet@production] VCL: Exp cache admission policy for varnish-fe

https://gerrit.wikimedia.org/r/386192

Nuria added a subscriber: Nuria.Nov 7 2017, 3:16 AM
Nuria added a comment.Nov 7 2017, 3:19 AM

This ticket should be a talk, really.

Legoktm added a subscriber: Legoktm.Nov 7 2017, 3:25 AM

Change 393227 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] WIP: cache: size-based cutoff for exp caching policy

https://gerrit.wikimedia.org/r/393227

Change 476311 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/476311

Change 476311 merged by Ema:
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/476311

Change 477424 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "cache: stop using nhw admission policy"

https://gerrit.wikimedia.org/r/477424

Change 477424 merged by BBlack:
[operations/puppet@production] Revert "cache: stop using nhw admission policy"

https://gerrit.wikimedia.org/r/477424

Change 477573 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_upload: hfp on frontends for large objects except for exp

https://gerrit.wikimedia.org/r/477573

Change 477574 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/477574

Change 477573 merged by Ema:
[operations/puppet@production] cache_upload: hfp on frontends for large objects except for exp

https://gerrit.wikimedia.org/r/477573

Change 477574 merged by Ema:
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/477574

1a1a11a added a subscriber: 1a1a11a.Dec 5 2019, 9:43 PM

Change 393227 abandoned by Ema:
WIP: vcl: size-based cutoff for exp caching policy

Reason:
CR expired. Partial implementation, we never got around to seriously work on the exp admission policy.

https://gerrit.wikimedia.org/r/393227

Change 588650 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588650 merged by Ema:
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588945 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 588945 merged by Ema:
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 merged by Ema:
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 merged by Ema:
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 594126 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Change 594126 merged by Ema:
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 merged by Ema:
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:19:12Z] <ema> cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

SnorriN removed a subscriber: SnorriN.Tue, May 5, 8:38 AM

Change 594896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594923 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594923 merged by Ema:
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

ema added a comment.Thu, May 7, 2:22 PM

I have tested the exp admission policy based on probability exponentially decreasing with object size on cp2027 (cache_text) and cp2028 (cache_upload) for about 48 hours. The experiment lasted between 2020-05-05T08:20 and 2020-05-07T10:30. The values used for rate and base were 0.9064392 and -18.16135 as calculated by simulation using cache_upload data: T144187#2775914. Given the amount of memory assigned to cp2027/cp2028, that gives adm_param ~ 64894.87. I think it's interesting to look at three different angles:

  • Frontend hit: responses with cache-status: hit-front divided by all responses excluding int-front (which is generally noise like TLS redirects). Note that this is different from what we usually define as "True Hitrate", which ignores pass too.
  • Cache usage in GB.
  • Transient memory usage in MB (reflecting mostly hfp/hfm).

Here I'd like to look at how the exp policy did in comparison to our straightforward "static" policy based on object size (we cache objects smaller than N bytes, with N=262144).
To make sure we compare apples with apples I have restarted varnish-fe on the instances were the experiment was running so that we start with an empty cache, and I have done the same on another instance in the same cluster using the "static" policy to compare against. The cache-hosts-comparison dashboard helps, I hope, visualizing the differences between the two cache admission policies.

cache_text

Frontend hit

Better on cp2029 (static) compared to cp2027 (exp) during the cache warm-up period (about +4%). Once the caches started filling up both policies gave comparable results, with static still doing better (about +0.6%).

Cache usage


Lower on exp than on static, ~47G vs 50G.

Transient memory
Much higher on exp than on static due to the higher frequency of hit-for-pass, up to ~1000M vs ~40M.

cache_upload

Frontend hit


Hitrate results are very different compared to text: here we had better results on cp2028 (exp) compared to cp2030 (static) during the cache warm-up period (about 3%). Once the caches start filling up static outperforms exp (also 3%).

Cache usage


Much lower on exp, 100G vs 200G give or take.

Transient memory


Much higher on exp, ~1.5G (spikes excluded) vs ~120M due to the higher frequency of hit-for-miss.

Conclusions

With the rate and base settings used during the experiment, the exp policy caches less than static on both text and upload, explaining the differences in frontend hitrate, and the significant differences in cache/transient memory usage. We should make the parameters configurable in hiera and further tune the policy till we reach at least the same performance of static, and hopefully improving it.

Change 594896 merged by Ema:
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594969 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move exp admission policy settings to hiera

https://gerrit.wikimedia.org/r/594969

Change 596408 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on cp2028

https://gerrit.wikimedia.org/r/596408