Better handling for one-hit-wonder objects
Open, MediumPublic
Actions

Assigned To

None

Authored By

	BBlack
	Aug 29 2016, 3:03 PM

Description

One of the sources of cache hit-rate inefficiency is one-hit-wonder objects. These are objects which are cacheable and get cached on usage, but tend to only get hit once in a blue moon (or once ever). They're still going to evict some other object based on LRU in a tight cache (especially our frontends), and that's usually the wrong choice. The hard part is "predicting" that a missed object will end up being a one-hit-wonder.

The first and most-general approach to this, that we've discussed (mostly offline from phab) in the past is to employ a Bloom Filter. The filter itself would be a fairly large memory object, but small compared to the actual dataset, and would allow us to tell the difference between "first miss ever" and "a miss on an object we've missed before", and effectively prevent caching an object the first time it's ever accessed (waiting for the second access to cache it). I first stumbled on this idea in: https://people.cs.umass.edu/~ramesh/Site/HOME_files/CCRpaper_1.pdf . @ema found this that's probably relevant too on Cuckoo Bloom filters being a better implementation choice: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf .

I had a new idea today, though:

While the Bloom type approach is more general and works for a single cache layer, we already have additional metadata available to us in our frontends (the smallest caches, where this matters most) because of our layered caches. All requests passing through our frontends inevitably go through one or more backend cache layers with much larger storage size. We communicate the miss/hit status from the backend to the frontend on response in the X-Cache-Int header. Therefore we could potentially implement logic in our frontend VCL of the form "If the miss-fetch from the backend indicates the backend also missed, do not cache the object in the frontend". On the second fetch, it would be a hit in the backend (from the earlier miss->fetch->store there) and then also get cached in the frontend as a twice-seen object. This has an upside versus a local bloom filter in each frontend, too: the frontends are effectively sharing (at no cost) the twice-seen dataset (so if it's seen for the first time ever in Frontend1, but Frontend2 had seen it recently, it wouldn't get counted as one-hit-wonder, like it would with local Bloom filters).

There are pragmatic bits and pieces to sort out for this idea:

Ideally, we should hook in the notion of varnishd uptime, so that the backend-miss-based one-hit-wonder filter doesn't kick in until a given cache daemon has been online and serving requests for longer than X minutes, to avoid doubling the request volume to the backend when warming up a cold cache. Uptime alone doesn't cover all such cases (e.g. depooled frontends...), but it would cover a lot of the common ones.
How exactly in v3 or v4 do we make a given one-hit-wonder request act in every way like a normal cacheable response but not touch the cache storage? The answer is probably obj.ttl=0s + beresp.uncacheable and similar, but we need to be sure.

This would probably be superior to local Bloom for frontends, and about the same as local Bloom for intermediate backends where chashing is already in effect. This approach doesn't work for backend-most backends which talk directly to the applayer; they'd need a real filter to solve their own one-hit-wonder problem.

Details

Subject	Repo	Branch	Lines +/-
cache: use 'exp' admission policy on cache_upload	operations/puppet	production	+13 -25
cache: enable exp caching policy on cp5001	operations/puppet	production	+10 -0
cache: enable exp caching policy on cp5001	operations/puppet	production	+10 -0
varnish: add script to test exp policy offline	operations/puppet	production	+18 -0
vcl: test 'exp' admission policy on cp2028	operations/puppet	production	+13 -0
vcl: move exp admission policy settings to hiera	operations/puppet	production	+8 -8
vcl: set large_objects_cutoff in hiera	operations/puppet	production	+3 -9
Revert "vcl: test 'exp' admission policy on two nodes"	operations/puppet	production	+0 -32
vcl: test 'exp' admission policy on two nodes	operations/puppet	production	+32 -0
vcl: pass fe_mem_gb to vcl_config	operations/puppet	production	+13 -12
vcl: 10M cutoff for the 'exp' admission policy	operations/puppet	production	+18 -11
vcl: move 'exp' admission policy to wm_admission_policies	operations/puppet	production	+20 -21
vcl: introduce wm_admission_policies	operations/puppet	production	+19 -26
vcl: remove n-hit-wonder admission policy	operations/puppet	production	+4 -49
WIP: vcl: size-based cutoff for exp caching policy	operations/puppet	production	+18 -6
cache: stop using nhw admission policy	operations/puppet	production	+2 -0
cache_upload: hfp on frontends for large objects except for exp	operations/puppet	production	+4 -1
Revert "cache: stop using nhw admission policy"	operations/puppet	production	+0 -2
cache: stop using nhw admission policy	operations/puppet	production	+2 -0
VCL: Exp cache admission policy for varnish-fe	operations/puppet	production	+142 -8
VCL: Exp cache admission policy for varnish-be	operations/puppet	production	+29 -0
N-hit-wonder: 4-hit, and improve filtering	operations/puppet	production	+5 -10
cache_upload: two-hit-wonder experiment, hit/2+	operations/puppet	production	+21 -0

Related Objects

Mentioned In: T279533: Add exp cache admission policy parameters to hiera
T275809: cache_upload cache policy + large_objects_cutoff concerns
T274888: cp_upload @ eqsin cascading failures, February 2021
T249809: cache_upload varnish-fe exhausting transient memory
T240503: Publish dataset about Memcached traffic for caching research
T225538: Request for a large request data set for caching research and tuning
T179156: 503 spikes and resulting API slowness starting 18:45 October 26
Mentioned Here: T249809: cache_upload varnish-fe exhausting transient memory
T128132: Compile a request data set for caching research and tuning

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a project: SRE. · View Herald TranscriptAug 29 2016, 3:03 PM

BBlack updated the task description. (Show Details)Aug 29 2016, 3:04 PM

I have been working for some time with Ramesh on cache admission policies for Akamai workloads, and I can contribute the following takeaways

frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonders but also two-hit-wonders etc. I have often seen optimal N values between 4 and 16 - though, I cannot yet say what works best for WMF's workloads. (N=1 is easy to implement when one chooses Bloom filters, but N>1 requires more work as "counting" Bloom filters are a lot harder to manage.)
a new idea that simplifies frequency-based admission is to use probabilistic admission: admit with probability 1/N. In expectation (geometric distribution), this means that objects would get admitted after N requests without any need for bookkeeping (data structures etc). In our experiments, this achieved 98% of the hit ratio of frequency-based admission.
for many CDN caching hierarchies, it makes sense to focus the front/memory cache on maximizing the object hit ratio (OHR) whereas the back/disk cache focuses on the byte hit ratio (BHR). This essentially means that the front/memory cache needs to cache mostly small objects, which significantly decreases the eviction volume so that it serves a higher fraction of requests (but possibly fewer bytes). For the back/disk cache, this has the advantage that there are fewer random reads and that the average requests is larger, which additionally increases sequentiality of disk/ssd reads and improves throughput.

Recently, we have been combining ideas 2. and 3. using an admission probability that is larger for small objects and (much) smaller for large objects. This turns out to be very effective but requires tricky parameterization. We've been experimenting with automatically and continuously adapting the parameter to the request traffic using some math models.
(This worked very well on Akamai production workloads, see a tech report here http://reports-archive.adm.cs.cmu.edu/anon/2016/CMU-CS-16-120.pdf)

I'm happy to run a few simulations to compare some alternatives once T128132 is ready.

I'd really like to include BBlack's new idea of exploiting the backend X-Cache, but I feel like this would require a huge simulation including all of WMF's traffic for a week or so (to simulate all of the backend). Therefore, I'm not sure how to incorporate this into a comparison - but I'd really like to.
Maybe the only way to do this is to prototype a few variants, deploy and do A/B testing? Other ideas?

(With regard to the second question in this phabricator item's description: yes (obj.ttl=0s + beresp.uncacheable) is correct. We've been experimenting a lot with Varnish + admission control, and that's what worked very well for us.)

OHR	#reqs served by cache / total #reqs
BHR	#bytes server by cache / total #bytes requested

ArielGlenn subscribed.Aug 31 2016, 8:45 AM

In T144187#2597553, @Danielsberger wrote:

frequency-based admission - admitting an object after N requests - often works much better when not only checking for one-hit-wonders but also two-hit-wonders etc. I have often seen optimal N values between 4 and 16 - though, I cannot yet say what works best for WMF's workloads. (N=1 is easy to implement when one chooses Bloom filters, but N>1 requires more work as "counting" Bloom filters are a lot harder to manage.)

Interesting! I wouldn't have expected such high numbers. In any case, our X-Cache data happens to also count hits, so we could extend it to "only cache in the frontend when the backend reports hit/4 or higher" and similar, experimentally.

for many CDN caching hierarchies, it makes sense to focus the front/memory cache on maximizing the object hit ratio (OHR) whereas the back/disk cache focuses on the byte hit ratio (BHR). This essentially means that the front/memory cache needs to cache mostly small objects, which significantly decreases the eviction volume so that it serves a higher fraction of requests (but possibly fewer bytes). For the back/disk cache, this has the advantage that there are fewer random reads and that the average requests is larger, which additionally increases sequentiality of disk/ssd reads and improves throughput.

We already have a very naive and static version of this on the upload (image) cluster - we just have a conditional in place that refuses to cache objects in the frontend if they're >= X size. X was traditionally 32MB. I raised it to 50MB a while back, and we experimentally observed that it didn't help OHR at all (as I guess we should've expected, but we have so little data about our dataset, it's sometimes nice to verify!). @ema and I were just talking yesterday about going back the other direction now and experimentally lowering it until we see a positive impact on frontend OHR, but we're in the midst of other changes right now, so it's put off for a better window to see results clearer.

mark subscribed.Aug 31 2016, 12:18 PM

Tying the X-Cache idea together with the tuneable hitcount and size from the paper, we could look at something of the form if (hc / (os * X) < Y) { uncacheable; }, where hc is backend hitcount, os is object size, and then X and Y are our tunables. Probably slightly more magic than that on the object-size front, but that would be the general idea.

Change 308995 had a related patch set uploaded (by BBlack):
cache_upload: one-hit-wonder experiment, hit/2

https://gerrit.wikimedia.org/r/308995

gerritbot added a project: Patch-For-Review.Sep 7 2016, 2:11 PM

fgiunchedi subscribed.Sep 7 2016, 2:14 PM

fgiunchedi unsubscribed.

SnorriN subscribed.Sep 12 2016, 1:30 PM

Change 308995 merged by BBlack:
cache_upload: two-hit-wonder experiment, hit/2

https://gerrit.wikimedia.org/r/308995

Change 311726 had a related patch set uploaded (by BBlack):
N-hit-wonder: 4-hit, and improve filtering

https://gerrit.wikimedia.org/r/311726

Change 311726 merged by BBlack:
N-hit-wonder: 4-hit, and improve filtering

https://gerrit.wikimedia.org/r/311726

I finally got some simulation results.

I'm comparing the following seven caching policies:

Name of Policy	Description of Policy
LRU	Pure least-recently-used eviction (w/o any admission policy) - an approximation of what Varnish does by default
Filter	LRU + frequency-based admission, i.e., "admit after N requests"
Threshold	LRU + size-threshold admission, i.e., "admit if size below threshold
Exp-size	LRU+ probability-size admission, i.e., "admit with probability exp{-size/c}" (mixtures of frequency and size-threshold admission)
GDS	Greedy-dual-size eviction (complicated eviction policy, which serves as a baseline)
GDSF	Greedy-dual-size-frequency eviction (even more complicated eviction policy, which serves as a baseline)
Infinite Cache	Cache with infinite capacity

My simulation assumes:

64 GB cache size (I know that's a bit small compared to what's deployed)
cp4006's front memory cache request traffic from T128132
best parameters for Filter, Threshold, and Exp-size (see below)

Results:

Table:

Name of Policy	hit ratio	improvement over LRU
LRU	0.7448694	0%
Filter(N=32)	0.8065881	8.3%
Threshold (threshold=64KB)	0.rELMO8927952f4407	19.9%
Exp-size (c=32KB)	0.9134466	22.6%
GDS	0.9121299	22.5%
GDSF	0.9270999	24.5%
Infinite Cache	0.9795388	31.5%

Best Filter parameter: N = 32

Filter-sensitivity.png (400×500 px, 8 KB)

Best Size Threshold parameter: Threshold = 64 KB

ThLRU-sensitivity.png (400×500 px, 9 KB)

Best Exp-size parameter: c = 32 KB

ExpLRU-sensitivity.png (400×500 px, 9 KB)

Happy to take input what to simulate next. Each simulation run takes about 48h.

128 GB front mem cache? what's the actual size of cp4006?
disk cache (750 GB)?
more caching policies (I have a ton more, and I'm willing to implement more)?

I can share the simulator code, if there's interest in it. It's just a bit messy and I'd like to clean that up before putting it up on github. I'm happy to mail an archive with the current version, though.

I'm happy to share a vmod for Filter and Exp-size (for Varnish 4.1). Simple implementation (using probabilities instead of bloom filters, no memory overhead, CPU overhead is a few instructions).

We also have a self-tuning variant of the Exp-size policy (so every cache finds the optimal c by its own and adapts it over time). This has some overhead, though, and my current implementation makes a few changes to Varnish code. If there is interest, I'm happy to try it.

• ema moved this task from Backlog to Caching on the Traffic board.Sep 30 2016, 2:29 PM

@Danielsberger Somehow I missed these updates a month ago!

In the meantime, what we did put in place for cache_upload a while back (based on rough guesswork and tuning) was a combination of:

N=5 (admit after 5 requests, using our backends' headers to find the limit, so the parameter is shared between all frontends in the same DC)
Size=256KB (keep in mind the bulk of our caches have ~100GB of frontend storage currently)

Both of these helped, and we're currently seeing an average frontend hitrate of ~85.5%, so that roughly jives with the ballpark of your numbers. We're combining the good effects of two filters here, but also using slightly more permissive filters than what the graphs show as optimal, and there's more memory available.

The exp(-size/c) filter sounds really interesting, as it could be fairly low-complexity way to see some good gains. We might be able to crop out some of its edge cases too, by combining it with some fairly permissive pre-filtering (e.g. N=3 + Size=4MB) and then letting it take over the complex tradeoffs in the middle ground.

Could you run your same simulation as above, but at 100G and 200G for total storage size? I'm curious what the shape of the change is (both in terms of achievable hitrate, and in how the optimal parameters vary).

Ok, here are the new results for cache sizes between 50GB and 400GB. For now, I only looked at the Filter and Exp admission policies.

Disclaimer: The two week memory trace might be too short to trust the large-cache results (>256GB) results.

Filter (N-request admission)

Hit ratio vs Filter parameter N (color: different cache sizes):

Filter-sensitivity.png (400×500 px, 46 KB)

Optimal Filter parameter N over cache size (log-size scale):

Exp (admission probability exponentially decreasing with size)

Hit ratio vs Filter parameter c (color: different cache sizes):

ExpLRU-sensitivity.png (400×500 px, 42 KB)

Optimal Filter parameter c over cache size (log-log scale):

A simple linear fit on this log-log scale

OPT_c_param = size^rate/2^base

gives us the following parameters

rate = 0.9064392 
base = -18.16135

My simulator code is now at github/dasebe/webcachesim.
To reproduce these plots

get and compile the simulator
compile the wmf trace parser and parse a few days from T128132

g++ -o rewrite_trace_wmf -std=gnu++11 helpers/rewrite_trace_wmf.cc
./rewrite_trace trace.txt cache.07.01.txt cache.07.02.txt ...

run the simulator (e.g., using gnu parallel and using this script
runscript.sh952 BDownload
)
use your favorite plot tool to reproduce the plots (e.g., R, libraries "data.table" and "ggplot2", using this script
plotscript.R3 KBDownload
)
if anything goes wrong, feel free to create an issue on github

Joe subscribed.Nov 7 2016, 11:28 AM

Here's some VCL/inline c that implements the exp-size amission policy for Varnish.

Here's a graph of the hit ratio for various cache sizes.

It seems to me that this tells us: investing in a better admission policy pays off more than accurately sizing the mallocs to squeeze out capacity.

@Danielsberger - Fascinating stuff, thanks so much for running all of this data! We're kind of swamped in various things right now (varnish4 transition and various nginx/openssl changes), but I think maybe next week we can start rolling out some changes that will let us do some testing on these policies and tunables.

• GWicke subscribed.Nov 8 2016, 6:42 PM

elukey subscribed.Dec 6 2016, 3:29 PM

Change 379512 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

Change 386192 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: Exp cache admission policy for varnish-fe

https://gerrit.wikimedia.org/r/386192

Change 379512 abandoned by Ema:
VCL: Exp cache admission policy for varnish-be

https://gerrit.wikimedia.org/r/379512

Change 386192 merged by Ema:
[operations/puppet@production] VCL: Exp cache admission policy for varnish-fe

https://gerrit.wikimedia.org/r/386192

BBlack mentioned this in T179156: 503 spikes and resulting API slowness starting 18:45 October 26.Nov 6 2017, 9:37 PM

• Nuria subscribed.Nov 7 2017, 3:16 AM

This ticket should be a talk, really.

Legoktm subscribed.Nov 7 2017, 3:25 AM

Change 393227 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] WIP: cache: size-based cutoff for exp caching policy

https://gerrit.wikimedia.org/r/393227

Krinkle awarded a token.Nov 24 2018, 1:24 AM

Krinkle added a project: Performance-Team (Radar).

Change 476311 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/476311

Change 476311 merged by Ema:
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/476311

Change 477424 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Revert "cache: stop using nhw admission policy"

https://gerrit.wikimedia.org/r/477424

Change 477424 merged by BBlack:
[operations/puppet@production] Revert "cache: stop using nhw admission policy"

https://gerrit.wikimedia.org/r/477424

Change 477573 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_upload: hfp on frontends for large objects except for exp

https://gerrit.wikimedia.org/r/477573

Change 477574 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/477574

Change 477573 merged by Ema:
[operations/puppet@production] cache_upload: hfp on frontends for large objects except for exp

https://gerrit.wikimedia.org/r/477573

Change 477574 merged by Ema:
[operations/puppet@production] cache: stop using nhw admission policy

https://gerrit.wikimedia.org/r/477574

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Jan 21 2019, 2:09 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:48 PM

Danielsberger mentioned this in T225538: Request for a large request data set for caching research and tuning.Jun 11 2019, 4:23 PM

lexnasser subscribed.Sep 30 2019, 9:19 PM

1a1a11a subscribed.Dec 5 2019, 9:43 PM

1a1a11a mentioned this in T240503: Publish dataset about Memcached traffic for caching research.Dec 11 2019, 8:46 PM

Change 393227 abandoned by Ema:
WIP: vcl: size-based cutoff for exp caching policy

Reason:
CR expired. Partial implementation, we never got around to seriously work on the exp admission policy.

https://gerrit.wikimedia.org/r/393227

Change 588650 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588650 merged by Ema:
[operations/puppet@production] vcl: remove n-hit-wonder admission policy

https://gerrit.wikimedia.org/r/588650

Change 588945 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 588945 merged by Ema:
[operations/puppet@production] vcl: introduce wm_admission_policies

https://gerrit.wikimedia.org/r/588945

Change 589341 merged by Ema:
[operations/puppet@production] vcl: move 'exp' admission policy to wm_admission_policies

https://gerrit.wikimedia.org/r/589341

Change 589342 merged by Ema:
[operations/puppet@production] vcl: 10M cutoff for the 'exp' admission policy

https://gerrit.wikimedia.org/r/589342

Change 594126 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Change 594126 merged by Ema:
[operations/puppet@production] vcl: pass fe_mem_gb to vcl_config

https://gerrit.wikimedia.org/r/594126

Change 594144 merged by Ema:
[operations/puppet@production] vcl: test 'exp' admission policy on two nodes

https://gerrit.wikimedia.org/r/594144

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:19:12Z] <ema> cp2027 and cp2029 (both text): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

Mentioned in SAL (#wikimedia-operations) [2020-05-05T08:27:32Z] <ema> cp2028 and cp2030 (both upload): varnish-fe restart to clear cache and evaluate 'exp' admission policy T144187 T249809

SnorriN unsubscribed.May 5 2020, 8:38 AM

Change 594896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594923 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

Change 594923 merged by Ema:
[operations/puppet@production] Revert "vcl: test 'exp' admission policy on two nodes"

https://gerrit.wikimedia.org/r/594923

I have tested the exp admission policy based on probability exponentially decreasing with object size on cp2027 (cache_text) and cp2028 (cache_upload) for about 48 hours. The experiment lasted between 2020-05-05T08:20 and 2020-05-07T10:30. The values used for rate and base were 0.9064392 and -18.16135 as calculated by simulation using cache_upload data: T144187#2775914. Given the amount of memory assigned to cp2027/cp2028, that gives adm_param ~ 64894.87. I think it's interesting to look at three different angles:

Frontend hit: responses with cache-status: hit-front divided by all responses excluding int-front (which is generally noise like TLS redirects). Note that this is different from what we usually define as "True Hitrate", which ignores pass too.
Cache usage in GB.
Transient memory usage in MB (reflecting mostly hfp/hfm).

Here I'd like to look at how the exp policy did in comparison to our straightforward "static" policy based on object size (we cache objects smaller than N bytes, with N=262144).
To make sure we compare apples with apples I have restarted varnish-fe on the instances were the experiment was running so that we start with an empty cache, and I have done the same on another instance in the same cluster using the "static" policy to compare against. The cache-hosts-comparison dashboard helps, I hope, visualizing the differences between the two cache admission policies.

cache_text

Frontend hit

Screenshot from 2020-05-07 15-55-35.png (1×2 px, 195 KB)

Better on cp2029 (static) compared to cp2027 (exp) during the cache warm-up period (about +4%). Once the caches started filling up both policies gave comparable results, with static still doing better (about +0.6%).

Cache usage

Screenshot from 2020-05-07 16-00-04.png (1×2 px, 215 KB)

Lower on exp than on static, ~47G vs 50G.

Transient memory
Much higher on exp than on static due to the higher frequency of hit-for-pass, up to ~1000M vs ~40M.

Screenshot from 2020-05-07 16-02-22.png (1×2 px, 234 KB)

cache_upload

Frontend hit

Screenshot from 2020-05-07 16-06-23.png (1×2 px, 183 KB)

Hitrate results are very different compared to text: here we had better results on cp2028 (exp) compared to cp2030 (static) during the cache warm-up period (about 3%). Once the caches start filling up static outperforms exp (also 3%).

Cache usage

Screenshot from 2020-05-07 16-09-34.png (1×2 px, 210 KB)

Much lower on exp, 100G vs 200G give or take.

Transient memory

Screenshot from 2020-05-07 16-10-40.png (1×2 px, 330 KB)

Much higher on exp, ~1.5G (spikes excluded) vs ~120M due to the higher frequency of hit-for-miss.

Conclusions

With the rate and base settings used during the experiment, the exp policy caches less than static on both text and upload, explaining the differences in frontend hitrate, and the significant differences in cache/transient memory usage. We should make the parameters configurable in hiera and further tune the policy till we reach at least the same performance of static, and hopefully improving it.

Change 594896 merged by Ema:
[operations/puppet@production] vcl: set large_objects_cutoff in hiera

https://gerrit.wikimedia.org/r/594896

Change 594969 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: move exp admission policy settings to hiera

https://gerrit.wikimedia.org/r/594969

Change 596408 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: test 'exp' admission policy on cp2028

https://gerrit.wikimedia.org/r/596408

CDanis subscribed.Jul 23 2020, 10:18 PM

Ladsgroup awarded a token.Feb 16 2021, 3:53 PM

Ladsgroup subscribed.

jbond subscribed.Feb 16 2021, 3:54 PM

BBlack mentioned this in T274888: cp_upload @ eqsin cascading failures, February 2021.Feb 25 2021, 4:14 PM

BBlack mentioned this in T275809: cache_upload cache policy + large_objects_cutoff concerns.Feb 25 2021, 7:04 PM

• ema mentioned this in T279533: Add exp cache admission policy parameters to hiera.Apr 7 2021, 12:46 PM

Change 677814 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] varnish: add script to test exp policy offline

https://gerrit.wikimedia.org/r/677814

Change 677814 merged by Ema:

[operations/puppet@production] varnish: add script to test exp policy offline

https://gerrit.wikimedia.org/r/677814

Change 677820 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677820

Change 677820 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677820

Change 677718 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677718

Change 677718 merged by Ema:

[operations/puppet@production] cache: enable exp caching policy on cp5001

https://gerrit.wikimedia.org/r/677718

Change 697538 had a related patch set uploaded (by Ema; author: Ema):

[operations/puppet@production] cache: use 'exp' admission policy on cache_upload

https://gerrit.wikimedia.org/r/697538

Change 697538 merged by Ema:

[operations/puppet@production] cache: use 'exp' admission policy on cache_upload

https://gerrit.wikimedia.org/r/697538

BBlack moved this task from Caching to Icebox-Temp on the Traffic board.Oct 8 2021, 5:29 PM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

awight awarded a token.Nov 26 2021, 1:54 PM

Fabfur subscribed.May 18 2023, 8:13 AM

Krinkle edited projects, added MediaWiki-Platform-Team (Radar); removed Performance-Team (Radar).Aug 6 2023, 10:18 PM

There are two open patches left: https://gerrit.wikimedia.org/r/c/operations/puppet/+/594969 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/594969

	F31806479: Screenshot from 2020-05-07 16-06-23.png
	May 7 2020, 2:22 PM

	F31806490: Screenshot from 2020-05-07 16-10-40.png
	May 7 2020, 2:22 PM

	F31806484: Screenshot from 2020-05-07 16-09-34.png
	May 7 2020, 2:22 PM

	F31806474: Screenshot from 2020-05-07 16-00-04.png
	May 7 2020, 2:22 PM

	F31806470: Screenshot from 2020-05-07 15-55-35.png
	May 7 2020, 2:22 PM

	F31806476: Screenshot from 2020-05-07 16-02-22.png
	May 7 2020, 2:22 PM

	F4707605: hitratio-cap.png
	Nov 8 2016, 5:59 PM

Better handling for one-hit-wonder objectsOpen, MediumPublicActions

Description

Details

Related Objects

Event Timeline

cache_text

cache_upload

Conclusions

Better handling for one-hit-wonder objects
Open, MediumPublic
Actions