Page MenuHomePhabricator

Move MainStash out of Redis to a simpler multi-dc aware solution
Closed, ResolvedPublic

Description

As evidenced during the investigation of T211721, we don't just write session data to the sessions redis cluster, but we also write all data from anything in MediaWiki that uses MediaWikiServices::getInstance()->getMainObjectStash() to the local redis cluster.

This breaks in a multi-dc setup for a number of reasons, first and foremost that we replicate redis data from the master DC to the others, but not the other way around as redis doesn't support multi-master replication.

Status quo

The MediaWiki "Main stash" is backed in WMF production by a Redis cluster labelled "redis_sessions", and is co-located on the Memcached hardware. It is perceived as having the following feaures:

  • Fairly high persistence. (It is not a cache, and the data is not recoverable in case of loss. But it is expected to lose data in main stash if under pressure under hopefully rare circumstances.) Examples of user impact:
    • session - User gets logged out and loses any session-backend data (e.g. book composition).
    • echo - Notifications might be wrongly marked as read or unread.
    • watchlist - Reviewed edits might show up agaihn as unreviewed.
    • resourceloader - Version hash churn would cause CDN and browser cache misses for a while.
  • Replication. (Data is eventually consistent and readable from both DCs)
  • Fast (low latency).

Note that:

  • "DC-local writable" is not on this list (mainstash only requires master writes), but given WMF is not yet mulit-dc we have more or less assumed that sessions are always locally writable and we need to keep performing sessions writes locally in a multi-DC world.
  • "Replication" is on the list and implemented in one direction for Redis at WMF. This would suffice if we only needed master writes, but for sessions we need local writes as well.

Future of sessions

Move Session and Echo data out of the MainStash into their own store that supports local writes and bi-di replication. This is tracked under T206016 and T222851.

Future of mainstash

To fix the behaviour of the software in a multi-dc scenario, I see the following possibilities, depending on what type of storage guarantees we want to have:

  • Option 1: If we don't need data to be consistent cross-dc: After we migrate the sessions data to their own datastore, we turn off replication and leave the current Redis clusters operating separately in each DC.
  • Option 2: If we need cross-dc consistency, but we don't need the data to have a guaranteed persistence: We can migrate the stash to mcrouter.
  • Option 3: If we need all of the above, plus persistence: We might need to migrate that to the same (or a similar) service to the one we will use for sessions.

I would very much prefer to be able to avoid the last option, if at all possible.

Related Objects

StatusSubtypeAssignedTask
Resolvedaaron
Resolvedaaron
ResolvedKrinkle
ResolvedKrinkle
Resolvedjijiki
Resolvedaaron
ResolvedEevans
Resolvedaaron
ResolvedKrinkle
ResolvedPapaul
Resolved Marostegui
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedtstarling
ResolvedPRODUCTION ERRORjcrespo
Resolvedtstarling
Resolved Marostegui
Resolvedtstarling

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Krinkle removed a subtask: Unknown Object (Task).Feb 13 2021, 11:50 PM

Next steps:

  • Decide on which database name(s) we need on the x2 cluster.
  • Create them.
  • Try connecting from MW CLI with --wiki=aawiki, and let it auto-create the objectcache table.

For database names, I propose one of the following options:

  1. Named after local wiki dbname, with each x2.##wiki database woud have its own objectcache table.
    • Inspiration: Like externalstore text tables (a core per-wiki table, transparently hosted on a different cluster).
    • Inspiration: Like Echo and GrowthExperiments (on extension1, similarly uses per-wiki tables). See wikitech:MariaDB#x1.
    • Downside: This would make garbage collection significantly more complicated (requiring an additional level of indirection and iteration). And complicates the wiki creation process (T158730), and general complexity from having per-wiki separation. Fine if there's a good reason to, but not a good default strategy imho. Afaik all keys can be in the same table, same as we do today for parsercache, main cache (memc), and main stash (redis). Also, we'd still need a name for the x2 db where the global version of "objectcache" would reside.
  2. "mainstash".
    • Inspiration: Like parsercache, where the database is named after the logical cluster (e.g. pc1, pc2). For main stash there isn't an obvious name for the logical cluster right now. There's no numbering (yet), and historically its not had a name as there isn't a "db" name from client perspective of Redis or Memcached.
    • Inspiration: Like centralauth (on s7), and flowdb and cognate (on x1), which are also cross-wiki feature with a db named after itself. See wikitech:MariaDB#x1.
  3. "wikishared".
    • Inspiration: Like extension1 and other cross-wiki concepts with MW in WMF prod.
    • Downside: Not unique. While this name is already used in multiple places, thusfar for DB-related things we have (afaik) only used it for non-core tables of MW extensions, and so far they're all on the extension1 cluster. This means that for debugging purposes, there is (afaik?) only one obvious definition of what "the wikishared database" means, e.g. for sql wikishared. And there are also notable exceptions. E.g. centralauth has its own database on s7 (its tableed are not under a db called "wikishared").
    • Upside: Not unique. If we choose to make "wikishared" the general name for cross-wiki tables that are fully part of the MW production schema but hosted externally, then re-using it would slightly simplify moving data from one database to another if we needed to. On the other hand, given this it not actually a concrete concept in the MW ecoystem, this dbg name needs to be configured per-feature either way. And it seems just as easy, if not easier, to move a whole single-table database then to move a table within a larger db. And having a unique name seems somewhat useful actually in terms of documenting, debugging connections etc.

I'm slighly leaning toward mainstash. With the table thus logically known as mainstash.objectcache on x2.

I would prefer either mainstash or wikishared, but I don't have any strong opinions about any of them.

I like "mainstash". If there is ever vertical sharding by extension, then "<group>stash" could be used as a DB name on separate clusters.

Change 752806 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: add \"globalKeyLbDomain\" option to use with \"globalKeyLB\"

https://gerrit.wikimedia.org/r/752806

Change 752807 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[operations/mediawiki-config@master] Add \"db-mainstash\" entry to $wgObjectCaches

https://gerrit.wikimedia.org/r/752807

Change 752806 merged by jenkins-bot:

[mediawiki/core@master] objectcache: add \"globalKeyLbDomain\" option to use with \"globalKeyLB\"

https://gerrit.wikimedia.org/r/752806

Change 773657 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: make "multiPrimaryMode" work with LB-based SqlBagOStuff instances

https://gerrit.wikimedia.org/r/773657

Change 773657 merged by jenkins-bot:

[mediawiki/core@master] objectcache: make "multiPrimaryMode" work with LB-based SqlBagOStuff

https://gerrit.wikimedia.org/r/773657

Change 779964 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] objectcache: Simplify docs of SqlBagOStuff 'purgePeriod' option

https://gerrit.wikimedia.org/r/779964

Change 779964 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Simplify docs of SqlBagOStuff 'purgePeriod' option

https://gerrit.wikimedia.org/r/779964

Next: Decide on how and whether to fragment the data in mainstashdb, e.g. like parser cache, like external store, or something else. @aaron to propose some ideas for DBAs to provide feedback/guidenace on.

Change 780903 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] objectcache: remove "multiPrimaryMode" DB type assertion

https://gerrit.wikimedia.org/r/780903

Change 780903 merged by jenkins-bot:

[mediawiki/core@master] objectcache: remove "multiPrimaryMode" DB type assertion

https://gerrit.wikimedia.org/r/780903

Next: Decide on how and whether to fragment the data in mainstashdb, e.g. like parser cache, like external store, or something else. @aaron to propose some ideas for DBAs to provide feedback/guidenace on.

What I want to avoid is the kind of overhead that the parser cache has for purging expired blobs. The linked list used for overflow pages probably gets fragmented over time due to wildly varying blob sizes.

Change 798030 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Simplify SqlBagOStuff configuration

https://gerrit.wikimedia.org/r/798030

Change 798030 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Simplify SqlBagOStuff class configuration

https://gerrit.wikimedia.org/r/798030

Here is the schema (just one table):

CREATE TABLE objectstash (
  keyname VARBINARY(255) DEFAULT '' NOT NULL,
  value MEDIUMBLOB DEFAULT NULL,
  exptime BINARY(14) NOT NULL,
  modtoken VARCHAR(22) DEFAULT '0000000000000000000000' NOT NULL,
  flags INT UNSIGNED DEFAULT NULL,
  INDEX exptime (exptime),
  PRIMARY KEY(keyname)
) ENGINE=innoDB COMMENT='MERGE_THRESHOLD=30';

Change 799433 had a related patch set uploaded (by Krinkle; author: Tim Starling):

[operations/mediawiki-config@master] Switch wgMainStash to db-mainstash

https://gerrit.wikimedia.org/r/799433

Change 752807 merged by jenkins-bot:

[operations/mediawiki-config@master] Add "db-mainstash" entry to $wgObjectCaches

https://gerrit.wikimedia.org/r/752807

Change 802669 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Add MariaDB grants for x2

https://gerrit.wikimedia.org/r/802669

Change 802669 merged by Tim Starling:

[operations/puppet@production] Add MariaDB grants for x2

https://gerrit.wikimedia.org/r/802669

I created the database and table, applied the grants, and tested it from eval.php, testing the wikiadmin@eqiad and wikiuser2022@codfw grants. In eqiad, it seems to work. In codfw, reading works, but writing fails because the LB is configured with read-only mode. That's fine, so I'm ready to deploy the switchover commit. But I'll leave it until Monday to give others a chance to review.

I'm fully alone next week as the rest of the team is gone, can we enable this the following week instead of Monday?
Thanks.

OK, how about June 14, 05:00 UTC?

OK, how about June 14, 05:00 UTC?

That would work. I will let you know for sure on Tuesday

Change 804024 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch wgMainStash back to Redis

https://gerrit.wikimedia.org/r/804024

Mark asked me to prepare a rollback plan which can be used to switch back to Redis if something goes wrong.

If it's immediately broken after deployment, then we can switch back without flushing Redis, but if we want to switch back some hours later, it seems best to flush Redis to avoid presenting stale data to MediaWiki.

Our use of Redis does not distinguish between mainstash keys and other callers. Using a separate DB or Redis instance would create its own risks rather than being a safe fallback. Running a FLUSHALL command would wipe CentralAuth sessions, resulting in user inconvenience since users without the "remember me" option would have to log in again. So I suggest only doing the flush if/when a rollback proves necessary.

The manual warns that FLUSHALL is "slow", presumably O(N), and will block all activity on the server while it completes (we are running 2.8 which has no ASYNC option). But number of keys per server is only 0.8M to 1.3M, so I figure it couldn't take more than a few seconds.

To flush all relevant eqiad redis servers, on deploy1002 run mwscript shell.php --wiki=enwiki and paste the following into it:

$servers = [ '10.64.0.125', '10.64.0.65', '10.64.16.21', '10.64.16.190', '10.64.32.153', '10.64.32.158', '10.64.48.91', '10.64.48.93' ];
foreach ( $servers as $host ) {
$redis = new Redis();
$redis->connect($host);
$redis->auth($wmgRedisPassword);
$redis->flushAll();
print $host . ": " . $redis->info()['db0'] . "\n";
}

This should show the number of remaining keys on each server after the flush, which should be a small number.

Then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/804024 can be deployed.

I got the server list from deploy1002:/etc/nutcracker/nutcracker.yml . I tested all the lines of the script except the flushAll(), and I confirmed that the Redis config has no security measure which would prevent clients from running FLUSHALL.

Thank you Tim!
I will bring this up on our Team meeting on Monday.

Change 799433 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch wgMainStash to db-mainstash

https://gerrit.wikimedia.org/r/799433

Mentioned in SAL (#wikimedia-operations) [2022-06-14T05:11:30Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T212129 Switch wgMainStash to db-mainstash (duration: 03m 38s)

Change 805266 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@master] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805266

Change 805268 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch AbuseFilter profiler back to redis

https://gerrit.wikimedia.org/r/805268

Change 805160 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.15] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805160

Change 805160 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.15] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805160

Change 805266 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805266

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:20:26Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/extension.json: T212129 (duration: 03m 32s)

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:24:19Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.15/extensions/AbuseFilter/includes/ServiceWiring.php: T212129 (duration: 03m 33s)

Change 805268 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch AbuseFilter profiler back to redis

https://gerrit.wikimedia.org/r/805268

Mentioned in SAL (#wikimedia-operations) [2022-06-14T06:28:26Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: T212129 (duration: 03m 31s)

Change 805361 had a related patch set uploaded (by Jforrester; author: Tim Starling):

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.16] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805361

Change 805361 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@wmf/1.39.0-wmf.16] Configure FilterProfiler cache separately

https://gerrit.wikimedia.org/r/805361

Metrics on db1151 look fine. Disk space usage on db1151 is growing at a rate of 5.9 GB per day, and there is 8.6TB available, implying exhaustion in 4 years, if it keeps growing at the same rate, which it is not expected to do. I think this is done.

Change 804024 abandoned by Tim Starling:

[operations/mediawiki-config@master] Switch wgMainStash back to Redis

Reason:

not needed

https://gerrit.wikimedia.org/r/804024

The amount of binlogs per day is also fine (not like parsercache which generates an insane amount of binlogs)