Page MenuHomePhabricator

Use a multi-dc aware store for ObjectCache's MainStash if needed.
Open, Needs TriagePublic

Description

As evidenced during the investigation of T211721, we don't just write session data to the sessions redis cluster, but we also write all data from anything in MediaWiki that uses MediaWikiServices::getInstance()->getMainObjectStash() to the local redis cluster.

This breaks in a multi-dc setup for a number of reasons, first and foremost that we replicate redis data from the master DC to the others, but not the other way around as redis doesn't support multi-master replication.

To fix the behaviour of the software in a multi-dc scenario, I see the following possibilities, depending on what type of storage guarantees we want to have:

  • Option 1: If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the two clusters operating separately
  • Option 2: If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.
  • Option 3: If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

Event Timeline

Joe created this task.Dec 17 2018, 2:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 17 2018, 2:55 PM
Joe renamed this task from Use a multi-dc aware store for `wgMainStash` if needed. to Use a multi-dc aware store for ObjectCache's MainStash if needed..Dec 18 2018, 8:32 AM
aaron added a subscriber: aaron.Dec 18 2018, 8:22 PM

We need persistence and replication. The plan is to use the same store as session for the rest of the object stash usage (probably Cassandra). Flags like WRITE_SYNC might be used in a few callers, and should use appropriate backend requests (e.g. QUOROM_* settings in Cassandra). The callers of the main object stash all need persistence and replication though (callers have already been migrated to stash vs WAN cache and such).

Joe added a comment.Dec 18 2018, 8:38 PM

We need persistence and replication. The plan is to use the same store as session for the rest of the object stash usage (probably Cassandra). Flags like WRITE_SYNC might be used in a few callers, and should use appropriate backend requests (e.g. QUOROM_* settings in Cassandra). The callers of the main object stash all need persistence and replication though (callers have already been migrated to stash vs WAN cache and such).

The fact we need persistence is quite surprising, given redis does LRU purging right now, has no backups, no disaster recovery system. Even worse, it's served through nutcracker that doesn't even ensure consistency.

So using cassandra for MainStash objects will mean a severe performance penalty and I'm not convinced it's justifiable - either in terms of costs or latencies.

Moreover, it seems to me that the usage of this datastore is unbound and uncontrolled, while if it is intended to be persisted long-term it should be treated like data on the databases, and we should discuss any new use before it's permitted in production.

So even if we decide to move this to a storage service similar to the sessions one, it will need multitenancy and client whitelisting per-usage, rather than being a random storage for unstructured data like it seems to be now.

This needs a thorough discussion ASAP.

Eevans added a subscriber: Eevans.Dec 18 2018, 9:07 PM

[ ... ]
This needs a thorough discussion ASAP.

I was under the impression we had.

Joe added a comment.Dec 18 2018, 9:10 PM

Well that discussion was limited to Session storage, and I stand by the idea that service, and its datastore, shouldn't be concerned with anything else than sessions, unless we come to a further agreement.

The current callers don't assume the level of durability as with mysql, just that the data will likely not be randomly removed (e.g. high eviction rate, power outage, network blips). The WAN cache callers can handle a fair amount of that on the other hand.

I think the criteria is a bit fuzzy, and I mentioned on another task that modding mcrouter to do replication for a separate logical pool (same servers) would probably be good enough (for now) for our current object stash callers (aside from sessions, which use $wgSessionCacheType with ObjectCache::getInstance and not getMainObjectStash(), though they happen to have the same backing store on wmf). In the interest of moving things forward, that could be a OK compromise, even if we loose disk persistent (e.g. redis rdb files).

The current callers don't assume the level of durability as with mysql, just that the data will likely not be randomly removed (e.g. high eviction rate, power outage, network blips).

But that actually happens. Data IS randomly removed. Looking at the graphs at https://redis.io/topics/lru-cache, we can see that in the case of 2.8 (in which we have been for many years), randomness in key removal was expected for a big percentage of the data. So that assumption on the level of the callers is wrong (but it doesn't seem to hurt).

Joe added a comment.Dec 19 2018, 7:17 AM

Looking at live data, we have at least one shard that's doing evictions (150k of them) and all shards have 10M+ expired keys.

Is this behaviour expected from a storage backend for MainStash? Do we assume all data there comes with a TTL?

mark added a subscriber: mark.Dec 19 2018, 3:05 PM

I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?

I think we might have to take a step back and allow for this discussion to run its course and reach consensus, and do what we need to do to not have goal deadlines interfere in the mean time...

CDanis added a subscriber: CDanis.Dec 19 2018, 3:44 PM

I am getting the impression here that some things are being rushed and finalized without time for a proper discussion between people/teams about the different possible solutions and their impact, after this new discovery. Is that because goals are due to be posted now?

I'm not sure who this is directed at (read: who's goals), but insofar as session storage goes, I don't personally feel any pressure with respect to goals (and I'm assuming we're on track with them either way). That said, I don't feel at all certain (yet) that I have the full picture, so maybe it does threaten a goal...

I think we might have to take a step back and allow for this discussion to run its course and reach consensus, and do what we need to do to not have goal deadlines interfere in the mean time...

Agreed, and we should also make some time for post-mortem. What is most concerning to me is that we could get this deeply into a project (session storage), only to be surprised that most of what we've been storing this whole time aren't sessions at all.

So, we use caching in MediaWiki for a ton of different things: parser cache, revision cache, counters, rate limiting, and so on.

By default since 1.27, sessions are stored in the same object cache as anything else, but we can specifically cut out sessions to their own storage class with a configuration variable $wgSessionCacheType.

Is a reasonable path forward to:

  1. Move sessions only to the new object store @Eevans and team are working on, while the rest of the cached objects stay on the current Redis infrastructure
  2. Make a decision whether and how to move other object types to the same or different object store later?
Joe added a comment.Jan 16 2019, 5:37 PM

So, we use caching in MediaWiki for a ton of different things: parser cache, revision cache, counters, rate limiting, and so on.
By default since 1.27, sessions are stored in the same object cache as anything else, but we can specifically cut out sessions to their own storage class with a configuration variable $wgSessionCacheType.
Is a reasonable path forward to:

  1. Move sessions only to the new object store @Eevans and team are working on, while the rest of the cached objects stay on the current Redis infrastructure
  2. Make a decision whether and how to move other object types to the same or different object store later?

The issue is that even for the main stash we need a datastore that can be written from both datacenters; redis doesn't offer that.

The path forward could be:

  1. Move the session storage to the new datastore, and see how it performs
  2. Make a call wether to move the mainstash to another instance of the same datastore, or to something else like a multi-dc mcrouter configuration.

I did some analysis of how we're using ObjectCache in MW core. It seems like we've only got a few calls to the main stash in core right now:

  • site stats (# of users, edits, pages, articles, images), which seems to be persistent
  • recent (<24h) page deletion flag
  • upload status (24h TTL)

I'd spitball that the first two need multi-DC, and the last one probably doesn't (although it wouldn't hurt).

My understanding was that there's *a lot* of non-session data in the redis store, though. Like, more than the session data. Can we confirm that? Is there a way to get a random sample of the keys in redis so we can analyse where they're coming from?

Actually, it looks like we've got some extensions using the MainStash, too. I'll dig further into it.

So, re-reading https://phabricator.wikimedia.org/T211721#4818580 and looking at code, it seems like Echo notifications are the big culprit.

I've been trying to track our cache use here:

https://www.mediawiki.org/wiki/User:EvanProdromou/ObjectCache_use

I see a few ways forward, in order of difficulty:

  1. Open tickets for core code and extensions that are using the ObjectStash "wrong" so that they use the WANCache instead
  2. Open tickets for core code and extensions to have configuration options to allow config-time routing of storage requests (echo seen flags go here, TOR exit node fetch status goes there, ...)
  3. Consider using some kind of namespacing in keys (session, echo, tor, confirmedit, ...) that could be used for routing at run time

One thing we'd need to make sure of is that the Session Storage API isn't designed to be a general-purpose key-value store. Brad covered it pretty well here. I think the primary feature that we use a lot in MW core and extensions is atomic increment, which is not important for sessions but pretty important for stats, counters, toggles, etc.

elukey added a subscriber: elukey.Feb 7 2019, 5:53 PM
Joe added a comment.Feb 8 2019, 7:09 AM

I did some analysis of how we're using ObjectCache in MW core. It seems like we've only got a few calls to the main stash in core right now:

  • site stats (# of users, edits, pages, articles, images), which seems to be persistent
  • recent (<24h) page deletion flag
  • upload status (24h TTL)

I'd spitball that the first two need multi-DC, and the last one probably doesn't (although it wouldn't hurt).
My understanding was that there's *a lot* of non-session data in the redis store, though. Like, more than the session data. Can we confirm that? Is there a way to get a random sample of the keys in redis so we can analyse where they're coming from?

It's as easy as connecting to the mc* redis instances in the inactive datacenter and looking at the output of KEYS *. Someone from my team can help gather stats, but yes, session data is a *tiny fraction* of both the data and the read/write requests to those redises.

Joe added a comment.Feb 8 2019, 7:17 AM

One thing we'd need to make sure of is that the Session Storage API isn't designed to be a general-purpose key-value store. Brad covered it pretty well here. I think the primary feature that we use a lot in MW core and extensions is atomic increment, which is not important for sessions but pretty important for stats, counters, toggles, etc.

Actually, that's the way it was designed, and it makes sense for session storage and for many other kind of usages.

I don't think we should use it for MainStash - we should instead use an mcrouter-based, multi-dc replicated, low-latency ephemeral storage which guarantees we keep the same sub-millisecond latencies we had with redis.

Cassandra-based storage will always have a latency that's orders of magnitude higher than anything that's purely in-memory; so while I could see a future where we want to have a mid-latency, multi-dc, highly consistent storage for things that need long-term persistence, but that's not how MainStash works today.

One thing I find quite appalling about this whole situation is that:

  • There is no mandate to only use the MW-provided storage interfaces from extensions that run in production
  • There is no track of which of said interfaces every extension uses
  • There is a quite large disconnect between the guarantees that developers might expect from said interfaces and what is really available in production.

The confusion around this ticket, and in general around moving things out of the redis datastore, seem to all come from this series of deficiencies we should fix together.

jijiki added a subscriber: jijiki.
jijiki moved this task from Backlog/Radar to In Progress on the User-jijiki board.Feb 8 2019, 11:00 AM
jijiki added a comment.Feb 8 2019, 3:42 PM

@EvanProdromou I will try to pull you some stats early next week, probably on Monday :)

jijiki added a comment.EditedFeb 12 2019, 6:37 PM

@EvanProdromou After some digging in mc20* redis (codfw), some stats are:

409606$wiki:MWSession
18223197$wiki:echo:seen:(alert/message)
22132316global:echo:seen:(alert/message)
959567global:loginnotify:prevSubnet
55002OAUTH:metawiki:nonce (oauth_token)
308605global:Wikimedia\Rdbms\ChronologyProtector:mtime
45473452Total keys

Please let me know if you need anything else:)

Joe added a comment.Feb 13 2019, 11:14 AM

@jijiki what is the total number of items stored on those redises? So that I can understand how much of that is used by the sessions. I guess less than 1%?

@Joe I updated the table above

jijiki moved this task from In Progress to St on the User-jijiki board.Feb 14 2019, 11:00 AM

Thanks @jijiki and @Joe

Assuming those stats are representative, I see some key takeaways:

  1. 40M / 45M keys (~88%) of keys are for echo. We should probably have a way to configure echo storage explicitly; I don't think that's possible right now. I'll open a ticket for that immediately.
  2. 2 of the top 6 types of keys aren't accounted for in the analysis I did. Looking at, say, ChronologyProtector, I still can't figure out why that's going to the main stash. So, there's probably a lot more code like that.
  3. About 3M keys not in this table. That's a long tail of storage. It'd be interesting to see the next 20 or 100 patterns.

After that, I think @Joe has some good points. I wonder if this would do the trick for us:

  • Standard way to get a stash for a component or extension which checks if there's a configured stash for that component
  • Standard way to define the contract for the component or extension's stash needs, so if there's none configured, the best-fitting one can be applied
  • Standard way to figure out what component or extension put a key in the stash

I'm going to follow up on this further.

ChronologyProtector positions should be applied to all DCs to handle cases the sticky DC cookie isn't enough (e.g. redirects to other WMF wiki domains). This is why it uses the main stash.

I moved this task so it's a direct child of the active/active data centre discussion, rather than being tied specifically to session storage.

The official requirements of Stash are in line with the REST-storage service being created for sessions.

Expectations of MediaWiki 'Main stash' are (as documented in ObjectCache.php):

  • Stored in primary DC.
  • Replicated to other DCs.
  • Support for READ_LATEST exists (e.g. master query) are possible, but should only happen in POST and Job context (discouraged over GET).
  • "Fairly strong persistence" (for third-parties it defaults to MySQL, at WMF we use Redis, and LRU evictions may happen but overall unpopular keys are expected to last fairly long, and the store is a good alternative to custom database tables for data that has many keys, is expensive to compute, or data that comes from an off-wiki source).

For any individual uses of the stash that can be changed to not need all these requirements and/or that didn't need it in the first place, can of course be migrated whenever. If we have limited storage capacity and expect to be unable to house current data, we may even want to require some of these to migrate first.

However, I do not think that "stash" as an abstract interface should be changed to something other than the above requirements. As such, assuming no unsupported raw use of these Redis servers, they could be decom'ed without much controversy I expect. Basically, as soon as MW config for "stash" switched from Redis to Cassandra.

Krinkle updated the task description. (Show Details)May 20 2019, 11:36 PM

Regarding Option 1: I do not know of any users of the stash that relate to dc-local coordination, as such, not replicating these values is the same as the values randomly disappearing (due to existing sometimes and lacking other times, e.g. between DCs). Aside from being a breach of the general contract, I don't think any of the current WMF-deployed extensions using this would be satisfied by a non-replicated Redis pool unless they would already be satisfied by simply using the "WAN cache" instead (mcrouter).

Regarding Option 2: Indeed, some consumers of the stash may have chosen it by accident and could be migrated. However, the MW software would still offer this interface and would still need a backend. I don't think operating yet another cache backend long-term is feasible or desirable, hence storing it in the same backend as for sessions makes sense I think. Alternatively, we could go back to storing it in MySQL. I'm not sure we have any other Tier 1 stores in prod that would satisfy this contract.

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

Expectations of MediaWiki 'Main stash' are (as documented in ObjectCache.php):

  • Stored in primary DC.
  • Replicated to other DCs.

Does this mean that writes to the stash only happen during POST requests (and in jobs)? Is this guaranteed, or does it need to be enforced? What are the chances that we'll also need to ability to write during a GET, and then replicate from a secondary to the primary DC?

aaron added a comment.May 21 2019, 8:54 PM

They preferably would happen in the main DC via POST, but that is not a hard requirement. This is partly why there are WRITE_SYNC and READ_LATEST flags in BagOStuff. A write should work in any DC and will should replicate asynchronously. If the backend supports WRITE_SYNC, then it should make the write best-effort synchronous.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

Joe added a comment.May 27 2019, 6:08 AM

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

Two reasons: performance and costs.

Performance-wise, a cassandra backed datastore has latencies that can be 10x those of redis. Also storing 1 GB on redis is relatively cheap compared to storing it on cassandra.

Also if we go with cassandra, uses of the mainstash that land in production will need to be vetted like we do vet the use of mysql and the creation of new tables. The interface should also allow for multitenancy.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

As far as I'm aware, counting is generally done using statsd. The notable exception are rate limits, but they only apply to things that require a POST.

Usually the reason for a writes to memcached during a GET are things getting loaded from slow storage and being written to a cache. These shouldn't need replication to other DCs, though.

Krinkle added a comment.EditedMay 27 2019, 7:32 PM

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

[..]
Also if we go with cassandra, uses of the mainstash that land in production will need to be vetted like we do vet the use of mysql and the creation of new tables. The interface should also allow for multitenancy.

Yes, that seems reasonable. It's keys should be namespaced (and presumably already are) with a primary key component, like we do with other BagOStuff backends.

I would expect that these writes do not need to happen on GET requests generally. For cases where this is needed somehow, we can follow the same approach as for other persistent writes: Queue a job instead.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

I'm not aware of any counters using the Stash. Even in the current version of the stash that uses Redis (and thus is fairly lightweight) its guarantees of persistence and replication do not align with our use cases of counters, they would be overkill. If any of those do happen to exist in fringe code we forgot about, they should be easy to migrate to dc-local Memcached or JobQueue writables, as we normally would.

For statistics in Graphite, we currently use Statsd.
For rate limits, we currently use dc-local Memcached.
For any derived data that is persisted for the MediaWiki application and accessible to end-users, we use MySQL (e.g. the user edit count). These are written to from POST requests only, or via the JobQueue. (And they do not use the stash currently.)

fsero moved this task from Backlog to Incoming on the serviceops board.Jun 20 2019, 2:22 PM
Joe edited projects, added serviceops-radar; removed serviceops.Jun 21 2019, 8:47 AM
Joe moved this task from Backlog to Watched on the serviceops-radar board.Jun 21 2019, 8:52 AM

As such, assuming no unsupported raw use of these Redis servers, they could be decom'ed without much controversy I expect.

We've got a few uses as locking servers, apparently. :-(

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

I'm not aware of any counters using the Stash.

AFAICT, only SiteStatsUpdate has counters that use the stash. It also uses atomic increment. There's also Echo, which does update the last-read flag on GET requests when looking at the Notifications page (not a counter, just a stash write on GET). ConfirmEdit uses the stash for captcha info; I can't tell if we use that in production, though. AbuseFilter also uses the main stash for some counters.

Even in the current version of the stash that uses Redis (and thus is fairly lightweight) its guarantees of persistence and replication do not align with our use cases of counters, they would be overkill. If any of those do happen to exist in fringe code we forgot about, they should be easy to migrate to dc-local Memcached or JobQueue writables, as we normally would.

Great. I guess the next step would be to identify those uses and open tickets to get those changes made.

@Krinkle Also, it sounds like we'd be changing the contract as loosely defined in the source code from 'Callers should be prepared for: Writes to be slower in non-"primary" (e.g. HTTP GET/HEAD only) DCs' to 'No writes on GET/HEAD'.

Yeah, I think we need to choose between making the Stash a 1) non-ephemeral dc-local storage that can also be written to from GET and non-primary DCs, or 2) a non-ephemeral replicated storage that should only be written to from the primary DC in POST requests.

I believe of these, option 2 would make more sense because virtually none of the current use cases for the Stash would fit option 1, and because we already have existing and more localised mechanisms for use cases that would fit option 1.

Foe example, when needing to persist data with high guarantees for reading by a job and later pruned, that should passed into the job itself (persisted as "job_params" in the JobQueue layer, currently backed by Kafka). Jobs may be queued from non-primary DCs and GET requests, and get relayed/scheduled for execution in the primary DC where they may responsibly write to the database or other storage layers.

Any current use cases for placing data in MainStash that need to persist data from GET requests and that cannot be accommodated in a performant/responsible manner by WANCache, SQL or session storage; could use the strategy of writing to the Stash from a deferred update or a queued job.

aaron added a comment.Jul 10 2019, 9:08 PM

I think (1) is more useful and fills a needed gap of writes on GET/HEAD. What I've been doing in T227376 is trying to move things off the Stash that can easily enough use some other store. This narrows down the "problem space".

I made https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/521953 in line with my idea for the store.

Ultimately, it looks like it can use DC-replicated mcrouter pool at this point (e.g. things like FlaggedRevs and probably AbuseFilter).

Joe added a comment.Jul 22 2019, 9:32 AM

I think (1) is more useful and fills a needed gap of writes on GET/HEAD. What I've been doing in T227376 is trying to move things off the Stash that can easily enough use some other store. This narrows down the "problem space".
I made https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/521953 in line with my idea for the store.
Ultimately, it looks like it can use DC-replicated mcrouter pool at this point (e.g. things like FlaggedRevs and probably AbuseFilter).

A replicated mcrouter is a multi-dc ephemeral replicated storage with no consistency guarantee, more or less. We could lose data at any time because a server is restarted, and those data would not be replicated back until someone writes them.

That's very different from what @Krinkle was describing before.

aaron added a comment.Jul 22 2019, 8:10 PM

ObjectCache always mentioned getMainStashInstance() as "Ephemeral global storage". It was just supposed to *try harder* to be persistent than memcached (rdb snapshots, expectation that stuff can *probably* still be there a week later or so). The existence of redis evictions and consistent re-hashing on host failure making data disappear or go stale was well known at the time it was picked as the original "stash".

If we still need it's caliber of persistence/replication, I don't mind un-abandoning my dynomite patch and testing that again...but it looks like we don't currently have a caller than cannot be made to use some existing thing ('db-replicated', DB table, JobQueue) or (best-effort, not eventually consistent) replicated mcrouter (which easily enough could exist). I'm going off of the table at https://docs.google.com/document/d/1tX8ekiYb3xYgpNJsmA1SiKqzkWc0F-_E4SGx6BI72vA/edit . I can't think of any thing that gets a "No" in both of the rightmost columns. If a new feature comes up, we can always explore adding other things to $wgObjectStashes and maybe a getMainPersistentObjectCache() convenience method, but I'm just thinking of the stuff that actually exists in the wild now.