Page MenuHomePhabricator

Use a multi-dc aware store for ObjectCache's MainStash if needed.
Open, Needs TriagePublic

Description

As evidenced during the investigation of T211721, we don't just write session data to the sessions redis cluster, but we also write all data from anything in MediaWiki that uses MediaWikiServices::getInstance()->getMainObjectStash() to the local redis cluster.

This breaks in a multi-dc setup for a number of reasons, first and foremost that we replicate redis data from the master DC to the others, but not the other way around as redis doesn't support multi-master replication.

To fix the behaviour of the software in a multi-dc scenario, I see the following possibilities, depending on what type of storage guarantees we want to have:

If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the two clusters operating separately

If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.

If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

Event Timeline

Joe created this task.Dec 17 2018, 2:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 17 2018, 2:55 PM
Joe renamed this task from Use a multi-dc aware store for `wgMainStash` if needed. to Use a multi-dc aware store for ObjectCache's MainStash if needed..Dec 18 2018, 8:32 AM
aaron added a subscriber: aaron.Dec 18 2018, 8:22 PM

We need persistence and replication. The plan is to use the same store as session for the rest of the object stash usage (probably Cassandra). Flags like WRITE_SYNC might be used in a few callers, and should use appropriate backend requests (e.g. QUOROM_* settings in Cassandra). The callers of the main object stash all need persistence and replication though (callers have already been migrated to stash vs WAN cache and such).

Joe added a comment.Dec 18 2018, 8:38 PM

We need persistence and replication. The plan is to use the same store as session for the rest of the object stash usage (probably Cassandra). Flags like WRITE_SYNC might be used in a few callers, and should use appropriate backend requests (e.g. QUOROM_* settings in Cassandra). The callers of the main object stash all need persistence and replication though (callers have already been migrated to stash vs WAN cache and such).

The fact we need persistence is quite surprising, given redis does LRU purging right now, has no backups, no disaster recovery system. Even worse, it's served through nutcracker that doesn't even ensure consistency.

So using cassandra for MainStash objects will mean a severe performance penalty and I'm not convinced it's justifiable - either in terms of costs or latencies.

Moreover, it seems to me that the usage of this datastore is unbound and uncontrolled, while if it is intended to be persisted long-term it should be treated like data on the databases, and we should discuss any new use before it's permitted in production.

So even if we decide to move this to a storage service similar to the sessions one, it will need multitenancy and client whitelisting per-usage, rather than being a random storage for unstructured data like it seems to be now.

This needs a thorough discussion ASAP.

Eevans added a subscriber: Eevans.Dec 18 2018, 9:07 PM

[ ... ]
This needs a thorough discussion ASAP.

I was under the impression we had.

Joe added a comment.Dec 18 2018, 9:10 PM

Well that discussion was limited to Session storage, and I stand by the idea that service, and its datastore, shouldn't be concerned with anything else than sessions, unless we come to a further agreement.

The current callers don't assume the level of durability as with mysql, just that the data will likely not be randomly removed (e.g. high eviction rate, power outage, network blips). The WAN cache callers can handle a fair amount of that on the other hand.

I think the criteria is a bit fuzzy, and I mentioned on another task that modding mcrouter to do replication for a separate logical pool (same servers) would probably be good enough (for now) for our current object stash callers (aside from sessions, which use $wgSessionCacheType with ObjectCache::getInstance and not getMainObjectStash(), though they happen to have the same backing store on wmf). In the interest of moving things forward, that could be a OK compromise, even if we loose disk persistent (e.g. redis rdb files).

The current callers don't assume the level of durability as with mysql, just that the data will likely not be randomly removed (e.g. high eviction rate, power outage, network blips).

But that actually happens. Data IS randomly removed. Looking at the graphs at https://redis.io/topics/lru-cache, we can see that in the case of 2.8 (in which we have been for many years), randomness in key removal was expected for a big percentage of the data. So that assumption on the level of the callers is wrong (but it doesn't seem to hurt).

Joe added a comment.Dec 19 2018, 7:17 AM

Looking at live data, we have at least one shard that's doing evictions (150k of them) and all shards have 10M+ expired keys.

Is this behaviour expected from a storage backend for MainStash? Do we assume all data there comes with a TTL?

mark added a subscriber: mark.Dec 19 2018, 3:05 PM

I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?

I think we might have to take a step back and allow for this discussion to run its course and reach consensus, and do what we need to do to not have goal deadlines interfere in the mean time...

CDanis added a subscriber: CDanis.Dec 19 2018, 3:44 PM

I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?

I'm not sure who this is directed at (read: who's goals), but insofar as session storage goes, I don't personally feel any pressure with respect to goals (and I'm assuming we're on track with them either way). That said, I don't feel at all certain (yet) that I have the full picture, so maybe it does threaten a goal...

I think we might have to take a step back and allow for this discussion to run its course and reach consensus, and do what we need to do to not have goal deadlines interfere in the mean time...

Agreed, and we should also make some time for post-mortem. What is most concerning to me is that we could get this deeply into a project (session storage), only to be surprised that most of what we've been storing this whole time aren't sessions at all.

So, we use caching in MediaWiki for a ton of different things: parser cache, revision cache, counters, rate limiting, and so on.

By default since 1.27, sessions are stored in the same object cache as anything else, but we can specifically cut out sessions to their own storage class with a configuration variable $wgSessionCacheType.

Is a reasonable path forward to:

  1. Move sessions only to the new object store @Eevans and team are working on, while the rest of the cached objects stay on the current Redis infrastructure
  2. Make a decision whether and how to move other object types to the same or different object store later?
Joe added a comment.Jan 16 2019, 5:37 PM

So, we use caching in MediaWiki for a ton of different things: parser cache, revision cache, counters, rate limiting, and so on.

By default since 1.27, sessions are stored in the same object cache as anything else, but we can specifically cut out sessions to their own storage class with a configuration variable $wgSessionCacheType.

Is a reasonable path forward to:

  1. Move sessions only to the new object store @Eevans and team are working on, while the rest of the cached objects stay on the current Redis infrastructure
  2. Make a decision whether and how to move other object types to the same or different object store later?

The issue is that even for the main stash we need a datastore that can be written from both datacenters; redis doesn't offer that.

The path forward could be:

  1. Move the session storage to the new datastore, and see how it performs
  2. Make a call wether to move the mainstash to another instance of the same datastore, or to something else like a multi-dc mcrouter configuration.

I did some analysis of how we're using ObjectCache in MW core. It seems like we've only got a few calls to the main stash in core right now:

  • site stats (# of users, edits, pages, articles, images), which seems to be persistent
  • recent (<24h) page deletion flag
  • upload status (24h TTL)

I'd spitball that the first two need multi-DC, and the last one probably doesn't (although it wouldn't hurt).

My understanding was that there's *a lot* of non-session data in the redis store, though. Like, more than the session data. Can we confirm that? Is there a way to get a random sample of the keys in redis so we can analyse where they're coming from?

Actually, it looks like we've got some extensions using the MainStash, too. I'll dig further into it.

So, re-reading https://phabricator.wikimedia.org/T211721#4818580 and looking at code, it seems like Echo notifications are the big culprit.

I've been trying to track our cache use here:

https://www.mediawiki.org/wiki/User:EvanProdromou/ObjectCache_use

I see a few ways forward, in order of difficulty:

  1. Open tickets for core code and extensions that are using the ObjectStash "wrong" so that they use the WANCache instead
  2. Open tickets for core code and extensions to have configuration options to allow config-time routing of storage requests (echo seen flags go here, TOR exit node fetch status goes there, ...)
  3. Consider using some kind of namespacing in keys (session, echo, tor, confirmedit, ...) that could be used for routing at run time

One thing we'd need to make sure of is that the Session Storage API isn't designed to be a general-purpose key-value store. Brad covered it pretty well here. I think the primary feature that we use a lot in MW core and extensions is atomic increment, which is not important for sessions but pretty important for stats, counters, toggles, etc.

elukey added a subscriber: elukey.Feb 7 2019, 5:53 PM
Joe added a comment.Feb 8 2019, 7:09 AM

I did some analysis of how we're using ObjectCache in MW core. It seems like we've only got a few calls to the main stash in core right now:

  • site stats (# of users, edits, pages, articles, images), which seems to be persistent
  • recent (<24h) page deletion flag
  • upload status (24h TTL)

    I'd spitball that the first two need multi-DC, and the last one probably doesn't (although it wouldn't hurt).

    My understanding was that there's *a lot* of non-session data in the redis store, though. Like, more than the session data. Can we confirm that? Is there a way to get a random sample of the keys in redis so we can analyse where they're coming from?

It's as easy as connecting to the mc* redis instances in the inactive datacenter and looking at the output of KEYS *. Someone from my team can help gather stats, but yes, session data is a *tiny fraction* of both the data and the read/write requests to those redises.

Joe added a comment.Feb 8 2019, 7:17 AM

One thing we'd need to make sure of is that the Session Storage API isn't designed to be a general-purpose key-value store. Brad covered it pretty well here. I think the primary feature that we use a lot in MW core and extensions is atomic increment, which is not important for sessions but pretty important for stats, counters, toggles, etc.

Actually, that's the way it was designed, and it makes sense for session storage and for many other kind of usages.

I don't think we should use it for MainStash - we should instead use an mcrouter-based, multi-dc replicated, low-latency ephemeral storage which guarantees we keep the same sub-millisecond latencies we had with redis.

Cassandra-based storage will always have a latency that's orders of magnitude higher than anything that's purely in-memory; so while I could see a future where we want to have a mid-latency, multi-dc, highly consistent storage for things that need long-term persistence, but that's not how MainStash works today.

One thing I find quite appalling about this whole situation is that:

  • There is no mandate to only use the MW-provided storage interfaces from extensions that run in production
  • There is no track of which of said interfaces every extension uses
  • There is a quite large disconnect between the guarantees that developers might expect from said interfaces and what is really available in production.

The confusion around this ticket, and in general around moving things out of the redis datastore, seem to all come from this series of deficiencies we should fix together.

jijiki added a subscriber: jijiki.
jijiki moved this task from Backlog/Radar to In Progress on the User-jijiki board.Feb 8 2019, 11:00 AM
jijiki added a comment.Feb 8 2019, 3:42 PM

@EvanProdromou I will try to pull you some stats early next week, probably on Monday :)

jijiki added a comment.EditedFeb 12 2019, 6:37 PM

@EvanProdromou After some digging in mc20* redis (codfw), some stats are:

409606$wiki:MWSession
18223197$wiki:echo:seen:(alert/message)
22132316global:echo:seen:(alert/message)
959567global:loginnotify:prevSubnet
55002OAUTH:metawiki:nonce (oauth_token)
308605global:Wikimedia\Rdbms\ChronologyProtector:mtime
45473452Total keys

Please let me know if you need anything else:)

Joe added a comment.Feb 13 2019, 11:14 AM

@jijiki what is the total number of items stored on those redises? So that I can understand how much of that is used by the sessions. I guess less than 1%?

@Joe I updated the table above

jijiki moved this task from In Progress to St on the User-jijiki board.Feb 14 2019, 11:00 AM

Thanks @jijiki and @Joe

Assuming those stats are representative, I see some key takeaways:

  1. 40M / 45M keys (~88%) of keys are for echo. We should probably have a way to configure echo storage explicitly; I don't think that's possible right now. I'll open a ticket for that immediately.
  2. 2 of the top 6 types of keys aren't accounted for in the analysis I did. Looking at, say, ChronologyProtector, I still can't figure out why that's going to the main stash. So, there's probably a lot more code like that.
  3. About 3M keys not in this table. That's a long tail of storage. It'd be interesting to see the next 20 or 100 patterns.

After that, I think @Joe has some good points. I wonder if this would do the trick for us:

  • Standard way to get a stash for a component or extension which checks if there's a configured stash for that component
  • Standard way to define the contract for the component or extension's stash needs, so if there's none configured, the best-fitting one can be applied
  • Standard way to figure out what component or extension put a key in the stash

I'm going to follow up on this further.

ChronologyProtector positions should be applied to all DCs to handle cases the sticky DC cookie isn't enough (e.g. redirects to other WMF wiki domains). This is why it uses the main stash.