Page MenuHomePhabricator

Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached)
Closed, ResolvedPublic

Description

We've seen over and over that when we have a spike in the memcached requests this causes higher latencies on the application servers.

Some keys are super hot - take for instance WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P244 which gets read about 4k times per second (!!!) - this is the wikidata item for the Library of Congress.

Mcrouter specifically allows to define a Warmup Route that does exactly what we want (at least on paper):

  • Read from the local memcached instance
  • On a miss, read from the shared pool
  • If the data was present in the shared pool, set it back to the local memcached

of course, we'll have to keep a short TTL (like 10 seconds) on the local instance, but this should reduce the network traffic by a lot as some of the hottest keys would go from being requested 7k times per second to the remote server down to A* N_servers / TTL (where A is a factor comprising cache expunges/misses) times per second.

  • Create a configuration that supports on-host memcached and puppetise it
  • Provide metrics/dashboard for on-host memcached
  • Test on-host memcached functionality and performance
  • Deploy in 10% of each mw* cluster (app, api, jobrunners, parsoid)
  • Deploy to 100%

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jijiki triaged this task as Medium priority.Feb 5 2020, 6:50 PM

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. We could wait to test this after we migrate mw* to buster, and use the newer LRU implementation that @elukey is so excited about

Change 594760 had a related patch set uploaded (by RLazarus; owner: RLazarus):
[operations/puppet@production] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance

https://gerrit.wikimedia.org/r/594760

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. […]

I'm also worried about the latency it might add unconditionally. I'd be inclined to persue to something more explicit from the application side. But, let's see how this goes.

Related:

  • MediaWiki already has support for tiered caching, which we use strategically in certain places to store something in both APCu+Memc, or Memc+SQL, etc.
  • The use cases for this task seem very similar to that of T248005, which is about having a second server-local store (in addition to APCu)( for longer tail of values that we can't fit in APCu and/or should survive restarts. We haven't yet decided whether to do it or what backend it would use on the app servers. But wiring it up to a local Memcached seems worth considering at least.

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. […]

I'm also worried about the latency it might add unconditionally. I'd be inclined to persue to something more explicit from the application side. But, let's see how this goes.

{{cn}} (both the comments I quoted above).

I expect exactly the opposite to happen, if any effect on performance can be seen. As per the influence on performance in case of overload, I doubt one smallish memcache instance can be responsible for creating issues under load.

Back to the discussion, we're explicitly setting this up as a way to ease the pressure on the memcached servers unconditionally. This is both very fair (LRU doesn't lie, only the hottest keys will remain cached locally) and very dumb (we don't necessarily cache locally only stuff we care about).

If we want to do something smarter in terms of local caching, and let the application control it, it will need constant, accurate tuning so that we don't have cases where a very hot key doesn't get cached locally. It seems to require more effort than our basic approach, and I think it should be considered as an alternative to APCu for most uses in that case (e.g. I would keep etcdconfig in APCu, but local caching of data would be moved to the local memcached instead).

From IRC
<AaronSchulz> I wonder how the backfill logic would work...would it be get+add()?

<AaronSchulz> if it's get()/set(), then it could be a little racey
<AaronSchulz> (in terms of tombstones)

<AaronSchulz> will tombstones go to local servers too?
<AaronSchulz> would this be for all keys?

Tombstones

The DELETE events from WANCache take the form of broadcasted SET mw-wan/*/ commands that place a short-lived tombsone (Note that mw-wan is only used for broadcasting "tombstone" purges, all other traffic is dc-local and requires no coordination). These are expected to not live longer than ~10s.

If local memcached are lazily populated based on storing a copy of incoming reads, then this might break the contract with WANObjectCache's TTLs, as this would mean tombstones can live twice as long if they are read close to expiry and then given another TTL locally of 10 seconds. We would need to somehow excempt these key patterns.

Or if they are populated based on SET traffic, then exempting mw-wan/*/ would suffice since that is only used for these tombstones, nothing else.

Purges

So long as the max TTL of local memcached is in the same ballpark as the purge tombstones and MW's general replag tolerance (around 10 seconds) then we don't need to worry about having to replay mw-wan traffic from other servers to each of the local memcaches.

I suspect that the keys that cause trouble are big text/JSON blobs and ParserOutput objects, all of which don't directly get purged. READ_VERIFIED is used by MultiWriteBagOStuff already when deciding whether it is safe to do cache-aside/backfill into lower cache tiers. This could probably be integrated into mcrouter by having such calls use a route prefix that uses a warmup route. A similar flag could be added/used by WANObjectCache for other blob keys that don't receive delete()/touchCheckKey() calls. This would would at a lot of I/O resistance without making hard-to-reason-about changes to cache invalidation.

In any case, non-value "sister" keys of WANCache should never use any local cache. One could use mcrouter config-based prefix routing, though that couples the key scheme to the puppet config a lot more. Ideally, WANCache would control that (via route keys based on the sister key type being VALUE or by a value key having READ_VERIFIED).

Doing from puppet config alone, for all keys, seems to hard to reason about.

Some keys are super hot - take for instance WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P244 which gets read about 4k times per second (!!!)

This doesn't make sense. This particular usage pattern from Wikibase has caused outages in the past and was given an APCu layer on top. (T97368, T209343). Given we have less than 4000 servers, it sounds like this has stopped working? I've filed T254536 to look into this.

Seconding Krinkle. I investigate more on this. I have lots of ideas on how to improve memcached read pressure (we load lots of items from memcached sometimes for example) but I need to have more metrics on what keys are being accessed too many times, what are the large data that is being sent often, what requests caused lots of read, etc. If it's not possible, it's fine, just let me know.

The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have. For now this is an experiment, and I don't think (out of our past experience) that caching keys with a TTL lower than the db max lag could really cause issues of consistency. I will go further: if this could cause consistency issues, then we're using memcached as a database with some guarantees of consistency and durability, neither of which is true, and we need to go back to question that.

So I see two possible use-cases for a local memcached:

  • The application-transparent approach proposed here, that lets mcrouter make its own decisions
  • Make mediawiki aware of the availability of a local cache, and use it as a second layer of caching controlled by the application. This would more or less replace what we currently do with APC and would have some advantages and disadvantages over the current situation

I want to underline the second option would not help us dealing with the fact we make too much traffic to memcached, esp on a few hot keys at a time, and that this problem has been going on for years now.

If the local-memcached's blind ttl is around the time we tolerate purges to be delayed for, and if it explicitly excludes mw-wan broadcasts and other WANCache internal keys then this is probably fine to proceed as experiment without further input.

  • WANCache:v:* - Fine, these are the bulk of values stored in Memached.
  • WANCache:[^v]:- Not fine, but should be minor, these are misc unusual/short-lived keys.
  • /*/mw-wan/ - Not fine, but should be minor, these are WANCache tombstones.
  • Everything else - Unsure, but should be minor, these are dc-local direct use of BagOStuff without WANCache. The only example that comes to mind are MWs rate limits. Sounds unsafe to split-brain on local app servers?

@aaron Can you confirm the above, and maybe fill in more about the non-WANCache?

@Joe Purge delay tolerance is indeed in the ballpark of 10s, after which makes MW automatically shorten lots of things if exceeded, e.g. refuse storing results of stuff in higher level caches, or use drastically shorter TTLs for them, shortened CDN maxage, eventually throttle edits/read-only etc.

See also https://wikitech.wikimedia.org/wiki/Memcached#WANObjectCache, which I wrote yesterday.

The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have.

What I commented came across the exact opposite of what I intended. I just wanted to say that particular case is indeed a problem (already have a patch to be fixed).

I totally support having a local cache for memcached, in fact my teammates had to drag me out of slapping an APCu cache on top of every memcached cache we had for the new term store, the inconsistency wasn't an issue for us AFAIK. If you don't want to have an outdated data (for example, when saving an edit), just don't use cache or replica.

I also would be a big fan of abstracting this away from mediawiki, most of multi-level handlers we see in CS (MLFQ, MMU handling of LLC and memory, Modern LRU in memcached) are abstracted away from programmers plus this seems to me very WMF specific and I would love to avoid complicating our cache handlers more than what is now.

One idea: internally reduce the TTL (I don't know if that's possible) but basically put a cap of, for example, a minute on TTL. If it's less than that, keep it, if it's more, just reduce it to one minute.

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Change 594760 merged by RLazarus:
[operations/puppet@production] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance

https://gerrit.wikimedia.org/r/594760

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Normally, I'd assume the most local host memcached failures would result in fast-fails, though I can always imagine weird cases. It would be nice to use a very low timeout. Maybe we start mcrouter with ...

--region=<eqiad|codfw> --cluster=<onhost>

Per https://github.com/facebook/mcrouter/wiki/Command-line-options#timeouts , cross-datacenter routes will still use --cross-region-timeout-ms, but the "mw" cluster will then use -cross-cluster-timeout-ms, and "onhost" will use --within-cluster-timeout-ms. We don't really have meaningful "clusters" of memcached now (basically just regions), so this could be how we use them (on host vs merely same DC). This would require explicit routing prefixes from MW though (even for ObjectCache::getLocalClusterInstance() and non-broadcasted WANCache ops). Does that seems doable?

jijiki renamed this task from Reduce read pressure on memcached servers by adding a machine-local Memcache instance to Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached).Aug 20 2020, 7:39 PM
jijiki removed RLazarus as the assignee of this task.
jijiki claimed this task.
jijiki updated the task description. (Show Details)

Change 629369 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mwdeb1001

https://gerrit.wikimedia.org/r/629369

Change 629431 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] WIP canary_appserver: install memcached if use_onhost_memcache

https://gerrit.wikimedia.org/r/629431

Change 629431 abandoned by Effie Mouzeli:
[operations/puppet@production] WIP canary_appserver: install memcached if use_onhost_memcache

Reason:
wrong approach

https://gerrit.wikimedia.org/r/629431

Change 629830 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] WIP mcrouter: install ohhost memcached on MediaWiki servers

https://gerrit.wikimedia.org/r/629830

Change 629830 abandoned by Effie Mouzeli:
[operations/puppet@production] mcrouter: install ohhost memcached on MediaWiki servers

Reason:
merge conflicts

https://gerrit.wikimedia.org/r/629830

Change 630845 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mcrouter: install ohhost memcached on MediaWiki servers

https://gerrit.wikimedia.org/r/630845

Change 629369 abandoned by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mwdeb1001

Reason:
conflict hell

https://gerrit.wikimedia.org/r/629369

Mentioned in SAL (#wikimedia-operations) [2020-09-30T11:06:08Z] <effie> disable puppet on P:mediawiki::mcrouter_wancache for 630845 - T244340

Change 630845 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter: install ohhost memcached on MediaWiki servers

https://gerrit.wikimedia.org/r/630845

Mentioned in SAL (#wikimedia-operations) [2020-09-30T11:33:30Z] <effie> enable puppet P:mediawiki::mcrouter_wancache for 630845 - T244340

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Tested in https://phabricator.wikimedia.org/T263958#6512539 :D

If the local-memcached's blind ttl is around the time we tolerate purges to be delayed for, and if it explicitly excludes mw-wan broadcasts and other WANCache internal keys then this is probably fine to proceed as experiment without further input.

  • WANCache:v:* - Fine, these are the bulk of values stored in Memached.
  • WANCache:[^v]:- Not fine, but should be minor, these are misc unusual/short-lived keys.
  • /*/mw-wan/ - Not fine, but should be minor, these are WANCache tombstones.
  • Everything else - Unsure, but should be minor, these are dc-local direct use of BagOStuff without WANCache. The only example that comes to mind are MWs rate limits. Sounds unsafe to split-brain on local app servers?

@aaron Can you confirm the above, and maybe fill in more about the non-WANCache?

@Joe Purge delay tolerance is indeed in the ballpark of 10s, after which makes MW automatically shorten lots of things if exceeded, e.g. refuse storing results of stuff in higher level caches, or use drastically shorter TTLs for them, shortened CDN maxage, eventually throttle edits/read-only etc.

@aaron @Krinkle Since anything mcrouter adds in the onhost cache gets a TTL of 10s (or less), and looking at our findings in T263958, do we have a very serious reason or a specific cases where having /*/mw use the WarmUpRoute, can cause serious problems? If not or if unsure, we can go ahead and roll this to the canaries and see what gives. It is a couple of puppet patches away

ps. I will need to create a proper dashboard for onhost memcached since the one we are already use does not make much sense in this context.

Short summary of IRC convo: Per doc, that's not compatible with hard requirements. Let's continue exposing the “*/mw-with-onhost-tier/” route on all servers, and then enabling them as-needed via wmf-config on the MediaWiki side either per-wiki or per-appserver, and then gather metrics accordingly.

Change 639775 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mcrouter_wancache: tune onhost memcached

https://gerrit.wikimedia.org/r/639775

Change 639775 merged by Effie Mouzeli:
[operations/puppet@production] mcrouter_wancache: tune onhost memcached

https://gerrit.wikimedia.org/r/639775

Change 640401 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mw1276

https://gerrit.wikimedia.org/r/640401

Mentioned in SAL (#wikimedia-operations) [2020-11-10T12:46:36Z] <effie> depool mw1276 to install onhost memcached - T244340

Change 640401 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mw1276

https://gerrit.wikimedia.org/r/640401

Change 640440 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on mw1263

https://gerrit.wikimedia.org/r/640440

Change 640440 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on mw1263

https://gerrit.wikimedia.org/r/640440

After merging 78588929801f and running onhost memcached on an api (mw1276) and an app (mw1263) for some time, the performance improvement on those two hosts has not been notable, and the same goes regarding traffic from those hosts towards the memcached cluster. Given though that we have seen what onhost memcached can do https://phabricator.wikimedia.org/T263958#6510350, we can continue to roll onhosts memcached to the canaries, and if there are no issues, we can roll out to API and app servers next week.

Change 641816 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on api and app canaries

https://gerrit.wikimedia.org/r/641816

Change 641816 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on api and app canaries

https://gerrit.wikimedia.org/r/641816

Some keys are super hot - take for instance WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P244 which gets read about 4k times per second (!!!) - this is the wikidata item for the Library of Congress.

Keys like this are wrapped in APC cache now T254536: CacheAwarePropertyInfoStore performs 4000 Memc ops/s (APC not working?)
I guess these keys would have seen some of the largest impact in general.

We have some other areas that could see an improvement from on host memcached, though I think it would be hard to detect those changes until this is rolled our on a more significant portion of hosts.

Change 643229 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on api and app servers

https://gerrit.wikimedia.org/r/643229

Mentioned in SAL (#wikimedia-operations) [2020-11-24T11:20:32Z] <effie> disable puppet on api and app servers to rollout onhost memcached - T244340

Change 643229 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on api and app servers

https://gerrit.wikimedia.org/r/643229

Mentioned in SAL (#wikimedia-operations) [2020-11-24T11:38:54Z] <effie> rolling depool and pool app and api clusters - T244340

Change 643234 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on jobrunners

https://gerrit.wikimedia.org/r/643234

Change 643235 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] hieradata: enable onhost memcached on parsoid servers

https://gerrit.wikimedia.org/r/643235

Mentioned in SAL (#wikimedia-operations) [2020-12-03T12:10:58Z] <effie> disable puppet on jobrunners and parsoid - T244340

Change 643234 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on jobrunners

https://gerrit.wikimedia.org/r/643234

Change 643235 merged by Effie Mouzeli:
[operations/puppet@production] hieradata: enable onhost memcached on parsoid servers

https://gerrit.wikimedia.org/r/643235

jijiki updated the task description. (Show Details)