Page MenuHomePhabricator

Re-architecture mainstash (x2) to allow easier maintenance
Closed, ResolvedPublic

Description

The idea is to make mainstash (currently x2) similar to PC: T373037: Make ParserCache more like a ring

Currently, for mainstash we have x2 which is one primary plus two idle replicas per datacenter. It is an extremely painful part of the infrastructure from DBAs perspective. So much that currently when we need to do reboots, we just reboot the servers live since that causes less user-facing errors than doing a switchover.

The proposal is to introduce three sections: ms1, ms2, ms3 each having a primary db only (per dc). MediaWiki would split the writes at random but it would write to 2 out of three. In case we need to do maintenance (reboots or hardware issues), we depool one section and let the keys to failover to the other sections. When doing the reads, MediaWiki would also read from two hosts as well and if one is missing or has lower exptime, it gets ignored. That prevents full loss of data in case we depool a section.

There will be still failure scenarios such as when TTL changes or set to indef or when we have to depool two sections, etc. but all seem to be well under error budget.

Event Timeline

Marostegui moved this task from Triage to Refine on the DBA board.
  • I don't mind having 2 sections as long as they are not in the current architecture (multi-master). The difference between operating 2 or 3 if they aren't multi-master, should be not hard, so if 3 is safer, we should go for 3. We have the hardware anyway.
  • I think not using mysql for this is probably the way to go.
  • I wouldn't want to have VM machines for a production service, at least not yet. I know we don't care about the data, but is our tooling ready for all this?. We don't need HW for parsercache, with the current hardware and removing the spares we can set up pc6 and pc7. If we feel we need more parsercache, we should budget for it next FY.
  • I don't mind having 2 sections as long as they are not in the current architecture (multi-master). The difference between operating 2 or 3 if they aren't multi-master, should be not hard, so if 3 is safer, we should go for 3. We have the hardware anyway.

That's actually the reason I'm suggesting to have multiple section. We should keep multi-master but if we can have multiple sections, then we can depool one section altogether and do anything we want. Basically exact same architecture of ParserCache right now. Beside the fact that it'd be simpler for us, it'd be its own group of architecture type.

  • I think not using mysql for this is probably the way to go.

Maybe Cassandra?

  • I wouldn't want to have VM machines for a production service, at least not yet. I know we don't care about the data, but is our tooling ready for all this?.

In the long run, I really want to use VM as much as possible for small non-critical databases. From application (mariadb, most of tools) HW vs VM should be fully transparent. Do you have anything specific in mind? I agree we shouldn't start with x2 though.

We don't need HW for parsercache, with the current hardware and removing the spares we can set up pc6 and pc7. If we feel we need more parsercache, we should budget for it next FY.

To me, it's more of a nice-to-have. Not really a "we need it otherwise, things will go down". It'd reduce the latency bumps when we depool. It doesn't matter much here though.

  • I don't mind having 2 sections as long as they are not in the current architecture (multi-master). The difference between operating 2 or 3 if they aren't multi-master, should be not hard, so if 3 is safer, we should go for 3. We have the hardware anyway.

That's actually the reason I'm suggesting to have multiple section. We should keep multi-master but if we can have multiple sections, then we can depool one section altogether and do anything we want. Basically exact same architecture of ParserCache right now. Beside the fact that it'd be simpler for us, it'd be its own group of architecture type.

Excellent, I like this solution.

  • I think not using mysql for this is probably the way to go.

Maybe Cassandra?

Should we ping Eric about this and see whether it makes sense there?

  • I wouldn't want to have VM machines for a production service, at least not yet. I know we don't care about the data, but is our tooling ready for all this?.

In the long run, I really want to use VM as much as possible for small non-critical databases. From application (mariadb, most of tools) HW vs VM should be fully transparent. Do you have anything specific in mind? I agree we shouldn't start with x2 though.

I don't know how the current VMs are set up, my worry is mostly:

  • VLANs, do we need changes there?
  • FWs, do we need to change things?
  • Grants? (which come from the above)
  • IPs

I think going for VM is fine but they need to be databases that can be:

  • Completely erased and there would be no problem
  • Assume they can be down for days
  • Not I/O bound
  • Not requiring large memory buffers.

We don't need HW for parsercache, with the current hardware and removing the spares we can set up pc6 and pc7. If we feel we need more parsercache, we should budget for it next FY.

To me, it's more of a nice-to-have. Not really a "we need it otherwise, things will go down". It'd reduce the latency bumps when we depool. It doesn't matter much here though.

Yeah, if this can happen great, but it shouldn't be our motivation for it.

So fixing this is a quarterly goal for me.

Here is a brain dump.

I looked around a bit. A lot of these data are quite close to canonical data and we can't really pull of a parsercache there.

Here are the most common types:

cumin2024@db1151.eqiad.wmnet[mainstash]> select SUBSTRING_INDEX(SUBSTRING_INDEX(keyname, ':', 2), ':', -1) as type_, count(*) from objectstash group by type_ order by count(*) desc limit 50;
+----------------------------------------+----------+
| type_                                  | count(*) |
+----------------------------------------+----------+
| ResourceLoaderModule-dependencies      |  4055367 |
| ParsoidOutputStash                     |   517740 |
| EditResult                             |    69703 |
| metawiki                               |    50454 |
| loginnotify                            |    23620 |
| uploadstatus                           |    10629 |
| page-recent-delete                     |     7953 |
| watchlist-recent-updates               |     7033 |
| abusefilter                            |     6884 |
| GrowthExperiments                      |      554 |
| twoColConflict_yourText                |      208 |
| FileImporter\Services\SuccessCache     |      160 |
| translate-translator-activity-v4       |       76 |
| oathauth-totp                          |       18 |
| WikimediaCampaignEvents-WikiProjectIDs |       16 |
| phonos                                 |       15 |
| Translate-MessageIndex-interim         |        1 |
| BagOStuff-long-key                     |        1 |
+----------------------------------------+----------+
18 rows in set (2.020 sec)

(metawiki is badly named key for OAUTH: OAUTH:metawiki:callback:...)

I looked around for cassandra too but back then when people looked into it, it wasn't possible since Cassandra keeps basically the "binlog" and it will become too big for the cluster given the rate of writes.

What would work though is the ring replication paradigm. Basically have three clusters (each having 1 host per dc), force mediawiki to write all three local clusters during write and read from all three during read. If there is a disagreement, pick the value with the largest exptime (that implies we don't change TTL of the same key, which is not correct 100% of the time but it's rare and combine that with rare event of disagreement, it becomes negligible).

The maintenance will be similar to parser cache, just depool the whole cluster. In that case, once repooled, if the key has been changed in the mean time, the old value will be ignored.

Open questions:

  • Can we delegate this to mysql itself? I know there are ways to do ring replication in mysql natively but even if we set it up, we will end up in the same situation anyway.
  • I haven't looked into feasibility of implementing my idea into mediawiki code-wise. If for example, MainStash code uses SqlBagOStuff, making it work would be quite "fun"
  • Horizontal scalability. Currently x2 gets 1.2K connections every second. In Comparison, pc2 takes 125 per second. A codfw replica in s8 gets ~300 connections per second. This is not really sustainable and probably should be split (or moved out of main stash) but OTOH, I want to improve the system so we can add more stuff to it.
    • Impact on the latency, if we all user requests open connection to x2, adding two more connections in every request will do a non-negligible latency hit to our systems. One thing to consider is to have two sections only.
  • x2 is so different I suggest renaming it ms1 and ms2 and ms3 instead.

What do you think @Marostegui ?

Change #1119745 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] objectcache: Introduce strongReadAndWrite to SqlBagOStuff

https://gerrit.wikimedia.org/r/1119745

The above patch implements my suggestion. Writing tests for it took way more time than expected.

Change #1119872 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] objectcache: Introduce dataRedundancy option for SqlBagOStuff

https://gerrit.wikimedia.org/r/1119872

Open questions:
...

  • Horizontal scalability. Currently x2 gets 1.2K connections every second. In Comparison, pc2 takes 125 per second. A codfw replica in s8 gets ~300 connections per second. This is not really sustainable and probably should be split (or moved out of main stash) but OTOH, I want to improve the system so we can add more stuff to it.
    • Impact on the latency, if we all user requests open connection to x2, adding two more connections in every request will do a non-negligible latency hit to our systems. One thing to consider is to have two sections only.

Randomly was thinking about this yesterday and I got the idea of n out of m situation that would fix the issue of horizontal scalability and load balancing too. In our production, we will have 3 clusters (ms1, ms2, ms3) and then we set data redundancy to 2. That way, main stash for each key picks two out of three hosts based on consistent hashing. That way the hit to latency will be minimal (two instead of one) and the load gets distributed among three clusters (each host will get 2/3rd of the current x2 master load). Maintenance won't have any difference from the proposed, the failure scenarios will shrink a bit, etc. This is again similar architecture to other NoSQL databases with stronger consistency guarantees.

Still I made a separate patch in case people don't like it.

Change #1119872 abandoned by Ladsgroup:

[mediawiki/core@master] objectcache: Introduce dataRedundancy option for SqlBagOStuff

Reason:

Merged to the parent patch

https://gerrit.wikimedia.org/r/1119872

What would be the user impact of depools?

At a quick skim, the main stash is used:

  • in core:
    • for the parser cache
    • for the Parsoid output stash
    • for the edit stash
    • for the ResourceLoader dependency store
    • for the "last seen" timestamps in the watchlist
    • to shows logs on the "missing page" view when the page was deleted recently
    • for tracking chunked upload progress
  • in OAuth for storing tokens (OAuth 1 request tokens / OAuth 2 auth codes) between the steps of the authorization process and prevent replay attacks, and also in OAuth 2 to store refresh tokens
  • in Phonos for storing error messages for the user during processing
  • in Translate for a lot of things: to cache messages and translator activity, to track progress during import, and to let batch delete jobs know when they are at the last item in the batch
  • in TwoColConflict for passing information from the page save attempt to the subsequent pageview (I think?)
  • in AbuseFilter for storing autopromotion blocks, and also for disabling filters which receive too many hits
  • in FileImporter for storing error messages for the user during processing
  • in GrowthExperiments for storing reports and pending notifications to mentors about their mentees
  • in LoginNotify to store the user's last known IP, and failed login count
  • in OATHAuth to prevent replay attacks
  • in WikimediaCampaignEvents to cache some sort of WDQS query (which seems like it will throw an exception when the stashed data is missing, but then regenerate it asynchronously)

Most of that seems like it would either automatically regenerate with minimal disruption, or non-critically degrade some operations which were happening right during the depool.
The maybe slightly more disruptive ones: watchlist notifications are sent twice; LoginNotify will generate some bogus warnings; OAuth and OATHAuth become vulnerable to replay attacks for a few seconds; AbuseFilter will fail to prevent autopromotion of blocked users; OAuth 2 apps will stop working until the user visits and they can reauthenticate.
The last one seems like potentially the biggest deal but in practice I don't think many apps use OAuth 2 and refresh tokens. (That said storing refresh tokens in the main stash seems like a bad idea. See also T336113.)

Some usecase are not impossible to regenerate but they are expensive to users. For example, the RL cases, it changes the hash and will lead to a lot of RL modules being evicted client-side. But as you said, none actually cause any major issues.

The data redundancy should take care of majority of cases. There are a lot of edge cases which is why that's fine. It will also allow us to get rid of modtoken which would simplify the code base a lot.

So fixing this is a quarterly goal for me.

Here is a brain dump.

I looked around a bit. A lot of these data are quite close to canonical data and we can't really pull of a parsercache there.

Here are the most common types:

cumin2024@db1151.eqiad.wmnet[mainstash]> select SUBSTRING_INDEX(SUBSTRING_INDEX(keyname, ':', 2), ':', -1) as type_, count(*) from objectstash group by type_ order by count(*) desc limit 50;
+----------------------------------------+----------+
| type_                                  | count(*) |
+----------------------------------------+----------+
| ResourceLoaderModule-dependencies      |  4055367 |
| ParsoidOutputStash                     |   517740 |
| EditResult                             |    69703 |
| metawiki                               |    50454 |
| loginnotify                            |    23620 |
| uploadstatus                           |    10629 |
| page-recent-delete                     |     7953 |
| watchlist-recent-updates               |     7033 |
| abusefilter                            |     6884 |
| GrowthExperiments                      |      554 |
| twoColConflict_yourText                |      208 |
| FileImporter\Services\SuccessCache     |      160 |
| translate-translator-activity-v4       |       76 |
| oathauth-totp                          |       18 |
| WikimediaCampaignEvents-WikiProjectIDs |       16 |
| phonos                                 |       15 |
| Translate-MessageIndex-interim         |        1 |
| BagOStuff-long-key                     |        1 |
+----------------------------------------+----------+
18 rows in set (2.020 sec)

(metawiki is badly named key for OAUTH: OAUTH:metawiki:callback:...)

I looked around for cassandra too but back then when people looked into it, it wasn't possible since Cassandra keeps basically the "binlog" and it will become too big for the cluster given the rate of writes.

What would work though is the ring replication paradigm. Basically have three clusters (each having 1 host per dc), force mediawiki to write all three local clusters during write and read from all three during read. If there is a disagreement, pick the value with the largest exptime (that implies we don't change TTL of the same key, which is not correct 100% of the time but it's rare and combine that with rare event of disagreement, it becomes negligible).

The maintenance will be similar to parser cache, just depool the whole cluster. In that case, once repooled, if the key has been changed in the mean time, the old value will be ignored.

Open questions:

  • Can we delegate this to mysql itself? I know there are ways to do ring replication in mysql natively but even if we set it up, we will end up in the same situation anyway.

I wouldn't want to introduce another "technology" for this - this is already a snowflake and this would require more testing and possibly changing our tooling. I'd totally prefer the idea of the architecture we have for parsercache. If this needs to live in mariadb of course.

  • I haven't looked into feasibility of implementing my idea into mediawiki code-wise. If for example, MainStash code uses SqlBagOStuff, making it work would be quite "fun"

Can't the parsercache code be reused for this somehow?

  • Horizontal scalability. Currently x2 gets 1.2K connections every second. In Comparison, pc2 takes 125 per second. A codfw replica in s8 gets ~300 connections per second. This is not really sustainable and probably should be split (or moved out of main stash) but OTOH, I want to improve the system so we can add more stuff to it.

At the moment it seems to be fine with it (keep in mind that we have 3 hosts per DC but we only use 1), of course this wouldn't be sustainable in the long term, but I'd assume if we split into 3, load would be split too? Otherwise we are going to need more hardware for this in the future.

    • Impact on the latency, if we all user requests open connection to x2, adding two more connections in every request will do a non-negligible latency hit to our systems. One thing to consider is to have two sections only.
  • x2 is so different I suggest renaming it ms1 and ms2 and ms3 instead.

I am ok with this.

Change #1123447 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Add config needed to re-architecture mainstash away from x2

https://gerrit.wikimedia.org/r/1123447

Change #1125556 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Migrate x2 off LB config

https://gerrit.wikimedia.org/r/1125556

Change #1119745 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Introduce dataRedundancy to SqlBagOStuff

https://gerrit.wikimedia.org/r/1119745

Change #1125556 merged by jenkins-bot:

[operations/mediawiki-config@master] Migrate x2 off LB config

https://gerrit.wikimedia.org/r/1125556

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:14:58Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]]

Change #1130542 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] etcd: Make Mainstash config globa variable

https://gerrit.wikimedia.org/r/1130542

Change #1130542 merged by jenkins-bot:

[operations/mediawiki-config@master] etcd: Make Mainstash config global variable

https://gerrit.wikimedia.org/r/1130542

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:21:02Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:26:01Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-24T10:38:42Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1125556|Migrate x2 off LB config (T383327 T387654)]], [[gerrit:1130542|etcd: Make Mainstash config global variable (T383327 T387654)]] (duration: 17m 39s)

Change #1130558 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Enable dataRedundancy for mainstash

https://gerrit.wikimedia.org/r/1130558

Change #1123447 abandoned by Ladsgroup:

[operations/mediawiki-config@master] Add config needed to re-architecture mainstash away from x2

Reason:

Split into several patches

https://gerrit.wikimedia.org/r/1123447

Change #1130558 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable dataRedundancy for mainstash

https://gerrit.wikimedia.org/r/1130558

Mentioned in SAL (#wikimedia-operations) [2025-03-24T11:46:45Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-24T11:51:26Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-24T12:02:28Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1130558|Enable dataRedundancy for mainstash (T383327)]] (duration: 15m 43s)

Change #1130573 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Remove x2

https://gerrit.wikimedia.org/r/1130573

Change #1130573 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove x2

https://gerrit.wikimedia.org/r/1130573

Mentioned in SAL (#wikimedia-operations) [2025-03-24T12:51:24Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1130573|Remove x2 (T383327)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-24T12:55:58Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1130573|Remove x2 (T383327)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-24T13:04:31Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1130573|Remove x2 (T383327)]] (duration: 13m 07s)

Ladsgroup claimed this task.
Ladsgroup moved this task from In progress to Done on the DBA board.

This is completed I believe?

Yup!