Page MenuHomePhabricator

Move MainStash out of Redis to a simpler multi-dc aware solution
Open, MediumPublic

Description

As evidenced during the investigation of T211721, we don't just write session data to the sessions redis cluster, but we also write all data from anything in MediaWiki that uses MediaWikiServices::getInstance()->getMainObjectStash() to the local redis cluster.

This breaks in a multi-dc setup for a number of reasons, first and foremost that we replicate redis data from the master DC to the others, but not the other way around as redis doesn't support multi-master replication.

Status quo

The MediaWiki "Main stash" is backed in WMF production by a Redis cluster labelled "redis_sessions", and is co-located on the Memcached hardware. It is perceived as having the following feaures:

  • Fairly high persistence. (It is not a cache, and the data is not recoverable in case of loss. But it is expected to lose data in main stash if under pressure under hopefully rare circumstances.) Examples of user impact:
    • session - User gets logged out and loses any session-backend data (e.g. book composition).
    • echo - Notifications might be wrongly marked as read or unread.
    • watchlist - Reviewed edits might show up agaihn as unreviewed.
    • resourceloader - Version hash churn would cause CDN and browser cache misses for a while.
  • Replication. (Data is eventually consistent and readable from both DCs)
  • Fast (low latency).

Note that:

  • "DC-local writable" is not on this list (mainstash only requires master writes), but given WMF is not yet mulit-dc we have more or less assumed that sessions are always locally writable and we need to keep performing sessions writes locally in a multi-DC world.
  • "Replication" is on the list and implemented in one direction for Redis at WMF. This would suffice if we only needed master writes, but for sessions we need local writes as well.

Future of sessions

Move Session and Echo data out of the MainStash into their own store that supports local writes and bi-di replication. This is tracked under T206016 and T222851.

Future of mainstash

To fix the behaviour of the software in a multi-dc scenario, I see the following possibilities, depending on what type of storage guarantees we want to have:

  • Option 1: If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the current Redis clusters operating separately in each DC.
  • Option 2: If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.
  • Option 3: If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Regarding Option 1: I do not know of any users of the stash that relate to dc-local coordination, as such, not replicating these values is the same as the values randomly disappearing (due to existing sometimes and lacking other times, e.g. between DCs). Aside from being a breach of the general contract, I don't think any of the current WMF-deployed extensions using this would be satisfied by a non-replicated Redis pool unless they would already be satisfied by simply using the "WAN cache" instead (mcrouter).

Regarding Option 2: Indeed, some consumers of the stash may have chosen it by accident and could be migrated. However, the MW software would still offer this interface and would still need a backend. I don't think operating yet another cache backend long-term is feasible or desirable, hence storing it in the same backend as for sessions makes sense I think. Alternatively, we could go back to storing it in MySQL. I'm not sure we have any other Tier 1 stores in prod that would satisfy this contract.

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

Expectations of MediaWiki 'Main stash' are (as documented in ObjectCache.php):

  • Stored in primary DC.
  • Replicated to other DCs.

Does this mean that writes to the stash only happen during POST requests (and in jobs)? Is this guaranteed, or does it need to be enforced? What are the chances that we'll also need to ability to write during a GET, and then replicate from a secondary to the primary DC?

They preferably would happen in the main DC via POST, but that is not a hard requirement. This is partly why there are WRITE_SYNC and READ_LATEST flags in BagOStuff. A write should work in any DC and will should replicate asynchronously. If the backend supports WRITE_SYNC, then it should make the write best-effort synchronous.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

Two reasons: performance and costs.

Performance-wise, a cassandra backed datastore has latencies that can be 10x those of redis. Also storing 1 GB on redis is relatively cheap compared to storing it on cassandra.

Also if we go with cassandra, uses of the mainstash that land in production will need to be vetted like we do vet the use of mysql and the creation of new tables. The interface should also allow for multitenancy.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

As far as I'm aware, counting is generally done using statsd. The notable exception are rate limits, but they only apply to things that require a POST.

Usually the reason for a writes to memcached during a GET are things getting loaded from slow storage and being written to a cache. These shouldn't need replication to other DCs, though.

Regarding Option 3: This seems the most straight-forward and preferable. Why is this undesirable?

[..]
Also if we go with cassandra, uses of the mainstash that land in production will need to be vetted like we do vet the use of mysql and the creation of new tables. The interface should also allow for multitenancy.

Yes, that seems reasonable. It's keys should be namespaced (and presumably already are) with a primary key component, like we do with other BagOStuff backends.

I would expect that these writes do not need to happen on GET requests generally. For cases where this is needed somehow, we can follow the same approach as for other persistent writes: Queue a job instead.

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

I'm not aware of any counters using the Stash. Even in the current version of the stash that uses Redis (and thus is fairly lightweight) its guarantees of persistence and replication do not align with our use cases of counters, they would be overkill. If any of those do happen to exist in fringe code we forgot about, they should be easy to migrate to dc-local Memcached or JobQueue writables, as we normally would.

For statistics in Graphite, we currently use Statsd.
For rate limits, we currently use dc-local Memcached.
For any derived data that is persisted for the MediaWiki application and accessible to end-users, we use MySQL (e.g. the user edit count). These are written to from POST requests only, or via the JobQueue. (And they do not use the stash currently.)

As such, assuming no unsupported raw use of these Redis servers, they could be decom'ed without much controversy I expect.

We've got a few uses as locking servers, apparently. :-(

I'd assume there would be a lot of counters and so on that might need to be written to even on a GET request.

I'm not aware of any counters using the Stash.

AFAICT, only SiteStatsUpdate has counters that use the stash. It also uses atomic increment. There's also Echo, which does update the last-read flag on GET requests when looking at the Notifications page (not a counter, just a stash write on GET). ConfirmEdit uses the stash for captcha info; I can't tell if we use that in production, though. AbuseFilter also uses the main stash for some counters.

Even in the current version of the stash that uses Redis (and thus is fairly lightweight) its guarantees of persistence and replication do not align with our use cases of counters, they would be overkill. If any of those do happen to exist in fringe code we forgot about, they should be easy to migrate to dc-local Memcached or JobQueue writables, as we normally would.

Great. I guess the next step would be to identify those uses and open tickets to get those changes made.

@Krinkle Also, it sounds like we'd be changing the contract as loosely defined in the source code from 'Callers should be prepared for: Writes to be slower in non-"primary" (e.g. HTTP GET/HEAD only) DCs' to 'No writes on GET/HEAD'.

Yeah, I think we need to choose between making the Stash a 1) non-ephemeral dc-local storage that can also be written to from GET and non-primary DCs, or 2) a non-ephemeral replicated storage that should only be written to from the primary DC in POST requests.

I believe of these, option 2 would make more sense because virtually none of the current use cases for the Stash would fit option 1, and because we already have existing and more localised mechanisms for use cases that would fit option 1.

Foe example, when needing to persist data with high guarantees for reading by a job and later pruned, that should passed into the job itself (persisted as "job_params" in the JobQueue layer, currently backed by Kafka). Jobs may be queued from non-primary DCs and GET requests, and get relayed/scheduled for execution in the primary DC where they may responsibly write to the database or other storage layers.

Any current use cases for placing data in MainStash that need to persist data from GET requests and that cannot be accommodated in a performant/responsible manner by WANCache, SQL or session storage; could use the strategy of writing to the Stash from a deferred update or a queued job.

I think (1) is more useful and fills a needed gap of writes on GET/HEAD. What I've been doing in T227376 is trying to move things off the Stash that can easily enough use some other store. This narrows down the "problem space".

I made https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/521953 in line with my idea for the store.

Ultimately, it looks like it can use DC-replicated mcrouter pool at this point (e.g. things like FlaggedRevs and probably AbuseFilter).

I think (1) is more useful and fills a needed gap of writes on GET/HEAD. What I've been doing in T227376 is trying to move things off the Stash that can easily enough use some other store. This narrows down the "problem space".

I made https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/521953 in line with my idea for the store.

Ultimately, it looks like it can use DC-replicated mcrouter pool at this point (e.g. things like FlaggedRevs and probably AbuseFilter).

A replicated mcrouter is a multi-dc ephemeral replicated storage with no consistency guarantee, more or less. We could lose data at any time because a server is restarted, and those data would not be replicated back until someone writes them.

That's very different from what @Krinkle was describing before.

ObjectCache always mentioned getMainStashInstance() as "Ephemeral global storage". It was just supposed to *try harder* to be persistent than memcached (rdb snapshots, expectation that stuff can *probably* still be there a week later or so). The existence of redis evictions and consistent re-hashing on host failure making data disappear or go stale was well known at the time it was picked as the original "stash".

If we still need it's caliber of persistence/replication, I don't mind un-abandoning my dynomite patch and testing that again...but it looks like we don't currently have a caller than cannot be made to use some existing thing ('db-replicated', DB table, JobQueue) or (best-effort, not eventually consistent) replicated mcrouter (which easily enough could exist). I'm going off of the table at https://docs.google.com/document/d/1tX8ekiYb3xYgpNJsmA1SiKqzkWc0F-_E4SGx6BI72vA/edit . I can't think of any thing that gets a "No" in both of the rightmost columns. If a new feature comes up, we can always explore adding other things to $wgObjectStashes and maybe a getMainPersistentObjectCache() convenience method, but I'm just thinking of the stuff that actually exists in the wild now.

So, I think with the work that @aaron has done on moving data out of mainstash, plus the sessions and echo work that @Eevans has done, we're about done here. Aaron has some clean-up tasks still in CR, so I'm going to untag CPT, unassign myself, and let him close this ticket when he feels it's justified.

Krinkle renamed this task from Use a multi-dc aware store for ObjectCache's MainStash if needed. to Move MainStash out of Redis to a simpler multi-dc aware solution.Jun 2 2020, 5:07 PM
Krinkle updated the task description. (Show Details)
In the task description, @Joe wrote:

I see the following possibilities […]:

  • Option 1: If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the current Redis clusters operating separately in each DC.
  • Option 2: If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.
  • Option 3: If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

There are a few mainstash consumers that don't need replication or persistence, and have or are being moved to simple WANObjectCache/Mcouter/Memcached instead.

However there definitely are various use cases for MainStash that remain, e.g. something with replication and persistence. However, I don't think this needs to migrated to something undesirable or complicated like another Cassandra/Sessionstore-like service.

I think there has been a "requirement" assumed here that is incorrect, which may've led to that conclusion, namely that we'd need dc-local writes. Afaik that was never a documented aspect of mainstash, and sessions/echo were pretty much the outlier in needing that. Also in terms of latencies, I think the remaining use cases are sufficently low traffic and not in big batches on GET requests, that they don't need ultra low latencies, either.

So that leaves us with:

  • Replication (eventual consistency from primary to secondary).
  • Writes to master.
  • Fairly good persistence (data loss is tolerance if rare, I've added examples to the task description).
  • Meh latency.

This seems quite well suited by the default MediaWiki backened for MainStash, namely the db-replicated BagOStuff, which uses a the objectcache table. If we want to isolate the traffic we could configure it to use that table on one of the external DBs rather than the main. See T229062#5558235 for details.

Also, with the clarifation around dc-local writes not being needed, is it at all desirable from your perspective to stick with Redis in the current one-way replication setup? That also seems to actually ticks all our boxes and would work fine from the application perspective. I can also see the benefit in it being less work to maintain by simply re-using the ext DB for this, but just wanted to explicitly ask and confirm whether running Redis at all is what we'd like to avoid, or whether it was mainly that it couldn't suit our needs (which has been resolved).

10 second samples from two Redis servers.

$ redis-cli -a "$AUTH" monitor > dump

  "GET" "wikidatawiki:abusefilter-profile:v3
  "SETEX" "wikidatawiki:abusefilter-profile:v3


$ cat dump | cut -d' ' -f4-5 | cut -d':' -f2 | sort | uniq -c | sort -nr
mc1030
7523 abusefilter-profile
7046 Wikimedia\\Rdbms\\ChronologyProtector
 292 session
 195 api-token
  57 api-token-blacklist
  42 centralautologin-token
  39 watchlist-recent-updates
   8 captcha
   3 loginnotify
   3 central-login-complete-token
   3 abusefilter
   3 ResourceLoaderModule-dependencies
mc1033
9493 Wikimedia\\Rdbms\\ChronologyProtector
7260 abusefilter-profile
 227 session
 159 api-token
  65 api-token-blacklist
  24 centralautologin-token
  11 watchlist-recent-updates
   9 captcha
   9 ResourceLoaderModule-dependencies
   8 loginnotify
   3 central-login-complete-token

These are the per-10-second rates on individual Redis servers. The Redis session cluster is composed of 18 equally distributed shards. This means the overall rate per second is 1/10th of the number above, multipled by 18.

ChronologyProtector, has a rate of about 700/s per shard, which is 12K/s overall.

From Grafana: Redis we see an overall ops/s rate of 20-25K/s, which supports the above equations with CP being about half of that.

Krinkle added a subscriber: Marostegui.

Telemetry on current data size and read/write actions can be found under Redis: redis_sessions in Grafana.

Currently:

  • Data size: 3GB in total. The Redis cluster have 9GB capacity (18x520MB). The data size is stable at ~3GB. Things are generally added and expired at about the same rate.
  • Rate: 25-35 K ops/second. (20-30K/s reads, 3K/s sets)

These will become lower and smaller once the last remaining sessions are moved over to Cassandra (see T243106 and Cassandra: Sessionstore in Grafana. The last switch introduced about +0.5G data and +13K req/sec of which +12K reads and +1K writes, so we can assume those to get deducted from the above.

Expected:

  • Data size: 2.5GB.
  • Rate: 12-22K ops/second. (8-18K/s reads, 2K/s sets)

Thanks for providing those figures.
While the data is pretty tiny, the expected amount of reads is quite big, so we should probably look at buying servers for this and not use x1.
Probably 3 servers per DC is enough, specially if the amount of reads will slowly decrease over time.

This is probably the right moment to get this budgeted as we are planning what we'd need for next FY. I will talk to @mark on Thursday and let him know about this requirements.

Krinkle changed the task status from Open to Stalled.Jul 6 2020, 8:36 PM

This is pending hardware for the mini extdb cluster for mainstash/db-replicated.

The hardware has been budgeted. We are expecting to buy it in Q2, is that ok with you?
I don't think we'd have enough manpower to be able to buy+set it up during Q1.

Krinkle lowered the priority of this task from High to Medium.Jul 29 2020, 2:10 AM

@Marostegui @LSobanski is this still on track to be purchased in Q2?

Yes, and hopefully also set up during Q2 :-)

Marostegui added subtasks: Unknown Object (Task), Unknown Object (Task).Oct 5 2020, 11:11 AM
Papaul closed subtask Unknown Object (Task) as Resolved.Nov 25 2020, 5:23 PM

@Marostegui @LSobanski Where are we regarding the purchase?

@Gilles @WDoranWMF Given that we are trying to phase out the redis_sessions cluster (T267581) it would be really lovely if we could commit to getting this task completed in Q3, if possible :)

@jijiki Ok, thank you. @Gilles may be we can chat it through? I'll try to find us a time.

Sure, @WDoranWMF, you can send me a meeting invite for next week. After that I'll be off for 3 weeks, returning on January 8.

@jijiki we've extended this OKR for Q3 on our end with the expectation that this would get unblocked soon. Aaron is off starting tonight and will return on January 4. He can start working on this as soon as the new DB cluster is set up and running.

Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Dec 12 2020, 11:08 PM

@jijiki hosts arrived a few days ago to the DC and were racked and installed past week T267043.
I have been off for a week, but you can follow the progress of their set up at: T269324

jijiki changed the task status from Stalled to Open.Dec 14 2020, 1:13 PM
jijiki awarded a token.

@Gilles @Krinkle I assume we want DB backups from this database section the same way we backup s1, or x1?

We probably don't and can't support backups (restoring outdated data would break business logic), but TODO: we should document lack of backups as part of the main stash interface contract for future consumers.

Thank you @Krinkle
One further question, we are still thinking whether to configure ROW or STATEMENT based for this new section, and one of the things we'd like to know if it is possible is...how likely do you think it will be to require schema changes on this new section?

@Marostegui Regarding schema changes, I'd say very unlikely. It is a key-value store that hasn't changed schema in well over a decade. In the unlikely event that a schema change would be required, my recommendation would be to slowly churn through the shards and truncate or drop each table and re-create it.

Per the task description, the stash is not a cache, and so upon loss there is some limited user impact since the data cannot be regenerated. However, it is strictly secondary data that are logically not required to persist indefinitely.

Thank you - that's very useful!

Krinkle removed a subtask: Unknown Object (Task).Sat, Feb 13, 11:50 PM