Page MenuHomePhabricator

Analyze impact of memcached extstore experiment (MediaWiki Platform Team)
Closed, ResolvedPublic

Description

In T352885#9955405, @Krinkle wrote (5 July 2024):

We discussed this in the MwEng-SvcOps meeting (13 June 2024). Extstore was enabled on a few hosts in the DC. Some stats differered, but we weren't able to come up with a testing strategy to prove or disprove an observable benefit from the MW application in this state since keys are sharded across all hosts, and traffic generally involves many different keys.

We noticed that the given hosts, contrary to my own expectations, there is a continuous non-zero trickle of evictions of tiny keys (e.g. < 200 bytes). This is surprising, because during research for T278392 and T336004, we @tstarling and I found that these slabs were not under pressure, and emperical testing showed that these values reliably persisted for at least a minute in practice. This is worrying, because we've since migrated to mcrouter-primary-dc for short-lived auth/nonce tokens, Rdbms-ChronologyProtector positions, and rate limiter counters.

The good news, and the reason we noticed, is that with extstore enabled, much more space is given to the tiny value "slab 4", and there is a flat line of zero undue evictions since.

Aside from this little happy accident, we found no per-host stats that are reason for concern. I agreed with Effie that rolling it to fully to the primary DC with the secondary as A/B-esque control would be a fine next step. We expect the worst outcome to be "no effect". And hope that there is enough similarity between the DCs traffic, yet enough separation in our data collection to notice an improvement here (e.g. via MW statds->prometheus stats that we have per-DC, using WANCache as way to measure cache hit ratio on meaningful keys; as well as Apache-level latency numbers such as the Appserver RED dashboards).

In T352885#9986523, @jijiki wrote (16 Jul 2024):

@Krinkle Extstore is fully enabled on eqiad

Event Timeline

Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)
Krinkle added a project: MediaWiki-Engineering.
Krinkle triaged this task as Medium priority.Jul 17 2024, 1:41 PM

I'll try to do this next week.

Background reading: https://github.com/memcached/memcached/wiki/Extstore.

Firstly, I've marked in Grafana the start and end of the extstore rollout (it was done over the course of a few hours, since it requires a restart of each Memcached server in the Eqiad datacenter, and restarting clears the cache; something we don't want to do everywhere all at once). Marking the range with operations,mediawiki,performance will make it show up on most of the dashboards that I'm checking, as those will query at least one of those tags (we generally use match=any).

Memcached

On the Memcached dashboard itself, we can see at a more meta level what extstore does. There's far fewer evicted keys (i.e. removal of unpopular keys), as they're getting a second life in the extstore.

Screenshot 2024-07-26 at 13.06.16.png (644×1 px, 159 KB)

On the Grafana: Memcached-Slabs dashboard we can see that this corresponds directly with values being written to extstore. We can see that Memcached is definitely serving a good portion of cache hits from the extstore. Looking at mc1053 as example, about 6K objects/s are read from extstore, of the ~60K/s hits.

The question is: What are those keys, and does it benefit MediaWiki, and if so, how? Note that while nearly 10% of hits are served from extstore (sounds pretty amazing!), the overall hitrate has not visibily increased. There is a thus both major increases in some slabs' hit rate, but also a decreases in others.

Screenshot 2024-07-26 at 21.55.20.png (528×952 px, 235 KB) Screenshot 2024-07-26 at 21.55.05.png (532×948 px, 433 KB) Screenshot 2024-07-26 at 21.55.13.png (530×940 px, 203 KB) Screenshot 2024-07-26 at 21.55.29.png (610×2 px, 357 KB)

WANObjectCache

The first thing I'm looking for is if we have any WANObjectCache keygroups that get notably fewer cache misses after the change. Note that a temporary increase in misses is expected as the rollout has implicitly wiped existing caches. We're looking for a lasting decrease.

https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=1718798400000&to=1721995200000

We can flip through every keygroup easily by toggling the items in the legend.

Screenshot 2024-07-26 at 13.11.14.png (764×2 px, 311 KB)

Screenshot 2024-07-26 at 13.11.52.png (780×2 px, 226 KB)

The first keygroup that seems to have a sharp decrease is CirrusSearchParserOutputPageProperties. We'll look at these more closely in a minute. Any other keygroups?

DPLQuery
Screenshot 2024-07-26 at 13.14.35.png (750×2 px, 205 KB)
backlinks
Screenshot 2024-07-26 at 13.15.35.png (764×2 px, 200 KB)
numlinks
Screenshot 2024-07-26 at 13.15.42.png (756×2 px, 196 KB)
ores_threshold_statistic*
Screenshot 2024-07-26 at 13.17.09.png (764×2 px, 236 KB)

Optimistically, what might be the case is that these keys are toward the lower side of the popularity spectrum. Previously, they would get removed to make space for other data. Now, maybe, some of these are moved out to extstore instead of getting evicted, and then the ones that we need a second time after a while, can come back from exststore as a cache hit. Thus, fewer cache misses.

Generally speaking, if a keygroup has a high number of cache hits ("popular") it will have relatively few cache misses. But, this is not always the case. A key group could be very popular and at the same time nearly only un-popular keys. For example, if there are a very large number of unique keys under a keygroup, it might be that none of the keys popular enough to stay in the cache. For example, data that varies by user_id or page_id, but is not used during pageviews, might struggle to get cache hits as the group is too spread out.

Having a nice drop in cache misses doesn't mean per-se that extstore is at work there. Let's look at the details on the dedicated WANObjectCache Key group dashboard, with the backlinks keygroup from above selected:

Screenshot 2024-07-26 at 13.33.28.png (1×2 px, 230 KB)

This is a false positive. The demand for this key changed entirely, probably unrelated to the exstore rollout, because the extstore rollout isn't changing how often MediaWiki "gets" a key. If both hits and misses drop, that means its simply called less often in the first place.

WANObjectCache Key group: CirrusSearchParserOutputPageProperties

Screenshot 2024-07-26 at 13.36.48.png (1×1 px, 315 KB)

The CirrusSearchParserOutputPageProperties keygroup looks like it has a clean drop in misses. Possibly due to extstore. It also appears to have a slight increase in p75, which could be explained by additional hits from extstore increasing the average Memcached latency very slightly by 0.1 or 0.2ms. But... There is no increase in cache hits here. So, I think, this is another false positive.

Screenshot 2024-07-26 at 13.46.29.png (1×2 px, 368 KB)

The ores_threshold_statistics keygroup is sufficiently low in call rate, that it feels like the kind of thing extstore should improve. Its misses became more spotty, and its cache hit ratio percentage looks a bit more uniform. But, it's all within standard deviation, so inclusive, but probable?

Other confounding factors:

  • MediaWiki train release (Thursday 11 July, group2 1.43.0-wmf.12->1.43.0-wmf.13) went out during the same time window, and with it numerous code changes, as well as a restart of MediaWiki servers and server-local "apcu" caches, which muddy things a bit. https://www.mediawiki.org/wiki/MediaWiki_1.43/wmf.13
  • Weekly seasonality. Wednesday != Friday. This is accounted for by reviewing the three previous and two following weeks as well.

API latency

Grafana: MediaWiki Action API summary

I added a new "Cumulative load time" panel to this dashboard (executionTiming.sum with scaleToSeconds), which measures how much time is spent in each request during a given second added up across the cluster. This is similar to CPU time or CPU seconds, except we measure wall-clock time instead. In practice, this is equal to request rate * mean latency. I added it with the idea that it'd be easer to spot small changes that the mean latency might not reflect very prominently. But, it seems pointless in context of this task because it (obviously) also makes it very sensitive to the exact request rate, which changed.

Screenshot 2024-07-26 at 13.55.07.png (685×2 px, 371 KB)

Screenshot 2024-07-26 at 13.55.03.png (656×2 px, 346 KB)

The biggest drop in cumulative load came from the cirrusbuilddoc API, and that's largely due to a drop in request rate, which seems unrelated to extstore, unless there's an indirection here were perhaps Cirrus uses the Action API to compute values for a memcached key?

Screenshot 2024-07-26 at 14.14.02.png (1×1 px, 287 KB)

Appserver latency

Grafana: Apache Backend-Timing (eqiad)

No obvious increase in low-latency buckets or decrease in high-latency buckets. If anything, the high-latency buckets seem to have increased slightly.

Screenshot 2024-07-26 at 14.30.47.png (1×2 px, 641 KB)

Screenshot 2024-07-26 at 14.54.44.png (1×2 px, 388 KB)

I also checked:

MicroStash reliability

Memcached's internal metrics show that it evicts significantly fewer cache keys, especially keys with a small value which now have literally zero evictions. The subset of these small keys that are stored via MediaWiki's MicroStash were meant to have zero evictions already (per T278392 and T336004) so there was potententially an issue here that is now solved. But, it's not a fact yet. For example, these may be other keys that happen to be small and unpopular. Or they may be MicroStash keys that are near the end of their short-lived expiry and would not have been accessed again anyway.

To see if extstore improved MicroStash reliability, we can look at fetch errors relating to nonce tokens, rate limiters, and anything else we store via MicroStash in MediaWiki (aka mcrouter-primary-dc).

In Logstash: mediawiki-warnings, we can find warnings about Rdbms ChronologyProtector unable to find the stored position index:

expected but failed to find position index {cpPosIndex}

The majority are from (seemongly bad-behaving) API clients to kube-mw-api-ext. Ignoring those, and focussing on eqiad only, we're left with the below. Not a decrease in misses. There does appear to be a (potentially unrelated) increase that started a week earlier.

Screenshot 2024-07-26 at 16.54.16.png (1×2 px, 255 KB)

In Logstash: mediawiki-warnings, we can also find warnings about CentralAuth nonce tokens:

Expected key {key} not found due to timeout.

Screenshot 2024-07-26 at 16.51.10.png (1×2 px, 253 KB)

These too have not decreased but increased.

Conclusion

In terms of high-level latency or cpu cost, or in terms of mid-level cache performance, I could not find anything conclusive where something likely improved due to extstore.

What is readily apparent is that Memcached is factually doing fewer cache evictions, especially for keys with a small value. However, I'm not able to find where MediaWiki makes use of (and benefits from) cache keys served from exststore.

Thank you for the assessment Timo! It seems like I need to do some more reading and see if there are other knobs to turn or if we should just conclude this experiment here. I will close this task for now so not have it lingering, and reopen if needed.