Page MenuHomePhabricator

Move Wikidata term store to separate database cluster
Closed, ResolvedPublic

Assigned To
None
Authored By
Lucas_Werkmeister_WMDE
Nov 22 2023, 2:52 PM
Referenced Files
F61409029: grafik.png
Jun 3 2025, 10:21 PM
F59975185: grafik.png
May 14 2025, 5:08 PM
F59975169: grafik.png
May 14 2025, 5:08 PM
F59975164: grafik.png
May 14 2025, 5:08 PM
F59975124: grafik.png
May 14 2025, 5:08 PM

Description

“Currently term store is reaching 340GB in wikidata and slowly reaching the wb_terms era”, so @Ladsgroup wants to “[split] s8 into a core cluster and a dedicated cluster for term store (tentatively called x3)”.

This is the general task to achieve that; T351802: Wikibase: Introduce separate database configuration for term store covers the necessary code changes in Wikibase; the Wikimedia production / operations / DBA side can happen either in this task or in additional subtasks. (Feel free to edit this task as needed.)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1138714 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Add support for x3 db cluster

https://gerrit.wikimedia.org/r/1138714

Change #1138714 merged by jenkins-bot:

[operations/mediawiki-config@master] Add support for x3 db cluster

https://gerrit.wikimedia.org/r/1138714

Mentioned in SAL (#wikimedia-operations) [2025-04-24T12:21:36Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]]

Mentioned in SAL (#wikimedia-operations) [2025-04-24T12:26:34Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-04-24T12:36:03Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1138714|Add support for x3 db cluster (T351820)]] (duration: 14m 28s)

Change #1145844 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] Move production term store traffic to x3

https://gerrit.wikimedia.org/r/1145844

Change #1145844 merged by jenkins-bot:

[operations/mediawiki-config@master] Move production term store traffic to x3

https://gerrit.wikimedia.org/r/1145844

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:21:03Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]]

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:27:27Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:41:51Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1145844|Move production term store traffic to x3 (T351820)]] (duration: 20m 48s)

Mentioned in SAL (#wikimedia-operations) [2025-05-14T11:47:25Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db2243 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76142 and previous config saved to /var/cache/conftool/dbconfig/20250514-114724-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-14T14:53:37Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db2181 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76153 and previous config saved to /var/cache/conftool/dbconfig/20250514-145336-ladsgroup.json

If all goes well, tomorrow I remove one replica from each cluster in eqiad.

This is turning a replica that's both s8 and x3 to only x3: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-05-14T05:23:16.577Z&to=2025-05-14T16:57:22.981Z&timezone=utc&var-job=$__all&var-server=db2243&var-port=9104&refresh=1m

grafik.png (270×949 px, 51 KB)

Innodb buffer pool efficiency had a pretty decent bump immediately

This is the other way around (removing x3 from a general purpose replica) https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-05-14T14:25:30.672Z&to=2025-05-14T17:00:09.723Z&timezone=utc&var-job=$__all&var-server=db2181&var-port=9104&refresh=1m

Innodb buffer pool efficiency actually went down (because a lot of hot data are not being read anymore) but all other metrics are much healthier now:

grafik.png (270×949 px, 45 KB)

grafik.png (270×949 px, 55 KB)

grafik.png (270×949 px, 53 KB)

Mentioned in SAL (#wikimedia-operations) [2025-05-15T04:53:45Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1256 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76157 and previous config saved to /var/cache/conftool/dbconfig/20250515-045345-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-15T04:56:32Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1192 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76158 and previous config saved to /var/cache/conftool/dbconfig/20250515-045631-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-16T11:19:52Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1214 from x3, remove db1257 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76261 and previous config saved to /var/cache/conftool/dbconfig/20250516-111952-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-16T11:23:45Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db2242 from x3, remove db2154 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76262 and previous config saved to /var/cache/conftool/dbconfig/20250516-112345-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-16T13:54:39Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db2166 and db1177 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76270 and previous config saved to /var/cache/conftool/dbconfig/20250516-135438-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T10:00:00Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1172 and db2164 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76291 and previous config saved to /var/cache/conftool/dbconfig/20250519-100000-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T10:26:16Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1178 and db2165 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76292 and previous config saved to /var/cache/conftool/dbconfig/20250519-102615-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T10:50:14Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1226 and db2163 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76293 and previous config saved to /var/cache/conftool/dbconfig/20250519-105013-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T11:54:11Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1167 and db2152 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76297 and previous config saved to /var/cache/conftool/dbconfig/20250519-115411-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T13:12:55Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1203 and db2162 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76304 and previous config saved to /var/cache/conftool/dbconfig/20250519-131254-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T13:26:11Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1211 from s8, move db2162 from s8 to x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76306 and previous config saved to /var/cache/conftool/dbconfig/20250519-132610-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T22:02:01Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1209 and db2195 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76316 and previous config saved to /var/cache/conftool/dbconfig/20250519-220201-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-19T22:04:33Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1255 and db2241 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76317 and previous config saved to /var/cache/conftool/dbconfig/20250519-220432-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-20T10:59:37Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db2167 from x3 (T351820)', diff saved to https://phabricator.wikimedia.org/P76333 and previous config saved to /var/cache/conftool/dbconfig/20250520-105937-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-05-20T11:02:15Z] <ladsgroup@cumin1002> dbctl commit (dc=all): 'Remove db1258 from s8 (T351820)', diff saved to https://phabricator.wikimedia.org/P76334 and previous config saved to /var/cache/conftool/dbconfig/20250520-110214-ladsgroup.json

I have split the replicas now and now each replica should be either in s8 or x3 but not in both. It already made the db cache much better (buffer pool).

Remaining todo:

  • Double check everything.
  • db2162 and db1211 must be moved under the future x3 primary, currently they are serving x3 traffic only but are replicas of s8 primary directly.
  • Move the future primaries out of replicas list
  • Leave it for a bit to make sure the traffic can be handled properly
  • Determine a candidate master in each dc.
  • More?

There's still work pending at: T393989 and T390530, which is mostly depending on

  1. new hardware
  2. Split the writes

We should probably go SBR for x3 so we don't have to worry about candidate masters (T383795) - I can do that in a couple of hours if that's fine too.

Regarding the new hardware, since it's codfw, I think it's not a blocker, other codfw replicas are quite under-utilized at the moment and have capacity. Of course if we have to do an emergency dc switchover, that's going to be fun. I leave that to you. Maybe the new host is almost up and ready anyway.

Where is the best place on grafana to see the shift in requests from the existing wikidat shard onto x3?

Can you provide me the name of the intended x3 primaries on each dc? My guesses, based on orchestrator would be db1255.eqiad.wmnet:3306 & db2241.codfw.wmnet:3306 but I would prefer a confirmation.

Can you provide me the name of the intended x3 primaries on each dc? My guesses, based on orchestrator would be db1255.eqiad.wmnet:3306 & db2241.codfw.wmnet:3306 but I would prefer a confirmation.

Yes. That's correct.

Change #1151692 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1211,db2162: Move them under x3

https://gerrit.wikimedia.org/r/1151692

Change #1151692 merged by Marostegui:

[operations/puppet@production] db1211,db2162: Move them under x3

https://gerrit.wikimedia.org/r/1151692

Mentioned in SAL (#wikimedia-operations) [2025-05-28T14:01:41Z] <marostegui> Set s8 (wikidata) as RO to split x3 from it T351820

Mentioned in SAL (#wikimedia-operations) [2025-05-28T14:04:42Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s8 (wikidata) as RO T351820', diff saved to https://phabricator.wikimedia.org/P76616 and previous config saved to /var/cache/conftool/dbconfig/20250528-140441-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-05-28T14:27:45Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s8 RW T351820', diff saved to https://phabricator.wikimedia.org/P76626 and previous config saved to /var/cache/conftool/dbconfig/20250528-142745-marostegui.json

The split has been done, x3 is now serving traffic (reads and writes)

Change #1151829 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] site.pp: Move x3 codfw hosts

https://gerrit.wikimedia.org/r/1151829

Change #1151829 merged by Marostegui:

[operations/puppet@production] site.pp: Move x3 codfw hosts

https://gerrit.wikimedia.org/r/1151829

I've dropped the term store tables on db2181 and db1172 (s8) and non-term store tables from db1211 and db2243 (x3). If you see errors, let me know.

Mentioned in SAL (#wikimedia-operations) [2025-06-03T14:01:33Z] <Amir1> dropping term store tables from s8 (T351820)

I dropped the termstore tables in s8 except on sanitarium master.

Mentioned in SAL (#wikimedia-operations) [2025-07-08T09:53:10Z] <Amir1> dropping term store tables on s8 sanitarium master (T351820)

Is x3 accessible via either 1) quarry 2) stat / analytic clusters?
I can see that analytics-mysql on the stat hosts has a --use-x1 option, but nothing for x3