Page MenuHomePhabricator

Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached)
Open, MediumPublic

Description

We've seen over and over that when we have a spike in the memcached requests this causes higher latencies on the application servers.

Some keys are super hot - take for instance WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P244 which gets read about 4k times per second (!!!) - this is the wikidata item for the Library of Congress.

Mcrouter specifically allows to define a Warmup Route that does exactly what we want (at least on paper):

  • Read from the local memcached instance
  • On a miss, read from the shared pool
  • If the data was present in the shared pool, set it back to the local memcached

of course, we'll have to keep a short TTL (like 10 seconds) on the local instance, but this should reduce the network traffic by a lot as some of the hottest keys would go from being requested 7k times per second to the remote server down to A* N_servers / TTL (where A is a factor comprising cache expunges/misses) times per second.

  • Create a configuration that supports on-host memcached and puppetise it
  • Provide metrics/dashboard for on-host memcached
  • Test on-host memcached functionality and performance
  • Deploy in 10% of each mw* cluster (app, api, jobrunners, parsoid)
  • Deploy to 100%

Event Timeline

Joe created this task.Feb 5 2020, 10:22 AM
jijiki triaged this task as Medium priority.Feb 5 2020, 6:50 PM
jijiki added a comment.Feb 5 2020, 7:15 PM

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. We could wait to test this after we migrate mw* to buster, and use the newer LRU implementation that @elukey is so excited about

RLazarus claimed this task.May 6 2020, 5:11 PM

Change 594760 had a related patch set uploaded (by RLazarus; owner: RLazarus):
[operations/puppet@production] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance

https://gerrit.wikimedia.org/r/594760

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. […]

I'm also worried about the latency it might add unconditionally. I'd be inclined to persue to something more explicit from the application side. But, let's see how this goes.

Related:

  • MediaWiki already has support for tiered caching, which we use strategically in certain places to store something in both APCu+Memc, or Memc+SQL, etc.
  • The use cases for this task seem very similar to that of T248005, which is about having a second server-local store (in addition to APCu)( for longer tail of values that we can't fit in APCu and/or should survive restarts. We haven't yet decided whether to do it or what backend it would use on the app servers. But wiring it up to a local Memcached seems worth considering at least.
Joe added a comment.May 25 2020, 5:10 AM

The idea is obviously sensible. I do have some concerns about how this will perform with our loaded mwservers. […]

I'm also worried about the latency it might add unconditionally. I'd be inclined to persue to something more explicit from the application side. But, let's see how this goes.

{{cn}} (both the comments I quoted above).

I expect exactly the opposite to happen, if any effect on performance can be seen. As per the influence on performance in case of overload, I doubt one smallish memcache instance can be responsible for creating issues under load.

Back to the discussion, we're explicitly setting this up as a way to ease the pressure on the memcached servers unconditionally. This is both very fair (LRU doesn't lie, only the hottest keys will remain cached locally) and very dumb (we don't necessarily cache locally only stuff we care about).

If we want to do something smarter in terms of local caching, and let the application control it, it will need constant, accurate tuning so that we don't have cases where a very hot key doesn't get cached locally. It seems to require more effort than our basic approach, and I think it should be considered as an alternative to APCu for most uses in that case (e.g. I would keep etcdconfig in APCu, but local caching of data would be moved to the local memcached instead).

From IRC
<AaronSchulz> I wonder how the backfill logic would work...would it be get+add()?

<AaronSchulz> if it's get()/set(), then it could be a little racey
<AaronSchulz> (in terms of tombstones)

<AaronSchulz> will tombstones go to local servers too?
<AaronSchulz> would this be for all keys?

Tombstones

The DELETE events from WANCache take the form of broadcasted SET mw-wan/*/ commands that place a short-lived tombsone (Note that mw-wan is only used for broadcasting "tombstone" purges, all other traffic is dc-local and requires no coordination). These are expected to not live longer than ~10s.

If local memcached are lazily populated based on storing a copy of incoming reads, then this might break the contract with WANObjectCache's TTLs, as this would mean tombstones can live twice as long if they are read close to expiry and then given another TTL locally of 10 seconds. We would need to somehow excempt these key patterns.

Or if they are populated based on SET traffic, then exempting mw-wan/*/ would suffice since that is only used for these tombstones, nothing else.

Purges

So long as the max TTL of local memcached is in the same ballpark as the purge tombstones and MW's general replag tolerance (around 10 seconds) then we don't need to worry about having to replay mw-wan traffic from other servers to each of the local memcaches.

aaron added a subscriber: aaron.Jun 4 2020, 9:50 PM

I suspect that the keys that cause trouble are big text/JSON blobs and ParserOutput objects, all of which don't directly get purged. READ_VERIFIED is used by MultiWriteBagOStuff already when deciding whether it is safe to do cache-aside/backfill into lower cache tiers. This could probably be integrated into mcrouter by having such calls use a route prefix that uses a warmup route. A similar flag could be added/used by WANObjectCache for other blob keys that don't receive delete()/touchCheckKey() calls. This would would at a lot of I/O resistance without making hard-to-reason-about changes to cache invalidation.

In any case, non-value "sister" keys of WANCache should never use any local cache. One could use mcrouter config-based prefix routing, though that couples the key scheme to the puppet config a lot more. Ideally, WANCache would control that (via route keys based on the sister key type being VALUE or by a value key having READ_VERIFIED).

Doing from puppet config alone, for all keys, seems to hard to reason about.

Krinkle added a comment.EditedJun 4 2020, 11:37 PM

Some keys are super hot - take for instance WANCache:v:global:CacheAwarePropertyInfoStore:wikidatawiki:P244 which gets read about 4k times per second (!!!)

This doesn't make sense. This particular usage pattern from Wikibase has caused outages in the past and was given an APCu layer on top. (T97368, T209343). Given we have less than 4000 servers, it sounds like this has stopped working? I've filed T254536 to look into this.

Seconding Krinkle. I investigate more on this. I have lots of ideas on how to improve memcached read pressure (we load lots of items from memcached sometimes for example) but I need to have more metrics on what keys are being accessed too many times, what are the large data that is being sent often, what requests caused lots of read, etc. If it's not possible, it's fine, just let me know.

Joe added a comment.Jun 5 2020, 11:27 AM

The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have. For now this is an experiment, and I don't think (out of our past experience) that caching keys with a TTL lower than the db max lag could really cause issues of consistency. I will go further: if this could cause consistency issues, then we're using memcached as a database with some guarantees of consistency and durability, neither of which is true, and we need to go back to question that.

So I see two possible use-cases for a local memcached:

  • The application-transparent approach proposed here, that lets mcrouter make its own decisions
  • Make mediawiki aware of the availability of a local cache, and use it as a second layer of caching controlled by the application. This would more or less replace what we currently do with APC and would have some advantages and disadvantages over the current situation

I want to underline the second option would not help us dealing with the fact we make too much traffic to memcached, esp on a few hot keys at a time, and that this problem has been going on for years now.

Krinkle added a comment.EditedJun 5 2020, 5:25 PM

If the local-memcached's blind ttl is around the time we tolerate purges to be delayed for, and if it explicitly excludes mw-wan broadcasts and other WANCache internal keys then this is probably fine to proceed as experiment without further input.

  • WANCache:v:* - Fine, these are the bulk of values stored in Memached.
  • WANCache:[^v]:- Not fine, but should be minor, these are misc unusual/short-lived keys.
  • /*/mw-wan/ - Not fine, but should be minor, these are WANCache tombstones.
  • Everything else - Unsure, but should be minor, these are dc-local direct use of BagOStuff without WANCache. The only example that comes to mind are MWs rate limits. Sounds unsafe to split-brain on local app servers?

@aaron Can you confirm the above, and maybe fill in more about the non-WANCache?

@Joe Purge delay tolerance is indeed in the ballpark of 10s, after which makes MW automatically shorten lots of things if exceeded, e.g. refuse storing results of stuff in higher level caches, or use drastically shorter TTLs for them, shortened CDN maxage, eventually throttle edits/read-only etc.

See also https://wikitech.wikimedia.org/wiki/Memcached#WANObjectCache, which I wrote yesterday.

The problem we're trying to solve here is not an individual cache key read, or even multiple cache key reads, but more generally to smooth any read swarm we have.

What I commented came across the exact opposite of what I intended. I just wanted to say that particular case is indeed a problem (already have a patch to be fixed).

I totally support having a local cache for memcached, in fact my teammates had to drag me out of slapping an APCu cache on top of every memcached cache we had for the new term store, the inconsistency wasn't an issue for us AFAIK. If you don't want to have an outdated data (for example, when saving an edit), just don't use cache or replica.

I also would be a big fan of abstracting this away from mediawiki, most of multi-level handlers we see in CS (MLFQ, MMU handling of LLC and memory, Modern LRU in memcached) are abstracted away from programmers plus this seems to me very WMF specific and I would love to avoid complicating our cache handlers more than what is now.

One idea: internally reduce the TTL (I don't know if that's possible) but basically put a cap of, for example, a minute on TTL. If it's less than that, keep it, if it's more, just reduce it to one minute.

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Change 594760 merged by RLazarus:
[operations/puppet@production] mcrouter_wancache: Add mcrouter support for a machine-local memcached instance

https://gerrit.wikimedia.org/r/594760

aaron added a comment.EditedJul 23 2020, 4:31 AM

Side note: if not already done, I'd double check how the WarmUp route behaves when the local memcached is not available for any reason (roll restart, temporary issues for weird bugs, etc..). Ideally the timeout should be very tight, few ms, and mcrouter should behave as the GET ended up with a miss. The thing that I am worried about is that mcrouter uses the 1s timeout that we set for the main shards, but I didn't investigate a lot in the doc/code so feel free to discard if already discussed :)

Normally, I'd assume the most local host memcached failures would result in fast-fails, though I can always imagine weird cases. It would be nice to use a very low timeout. Maybe we start mcrouter with ...

--region=<eqiad|codfw> --cluster=<onhost>

Per https://github.com/facebook/mcrouter/wiki/Command-line-options#timeouts , cross-datacenter routes will still use --cross-region-timeout-ms, but the "mw" cluster will then use -cross-cluster-timeout-ms, and "onhost" will use --within-cluster-timeout-ms. We don't really have meaningful "clusters" of memcached now (basically just regions), so this could be how we use them (on host vs merely same DC). This would require explicit routing prefixes from MW though (even for ObjectCache::getLocalClusterInstance() and non-broadcasted WANCache ops). Does that seems doable?

jijiki moved this task from Incoming 🐫 to Unsorted on the serviceops board.Aug 17 2020, 11:46 PM
jijiki renamed this task from Reduce read pressure on memcached servers by adding a machine-local Memcache instance to Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached).Thu, Aug 20, 7:39 PM
jijiki removed RLazarus as the assignee of this task.
jijiki claimed this task.
jijiki updated the task description. (Show Details)
jijiki moved this task from Inbox 🐅 to Next up 🥌 on the User-jijiki board.Tue, Sep 8, 11:19 AM
jijiki moved this task from Next up 🥌 to In Progress on the User-jijiki board.Wed, Sep 16, 9:09 AM