Page MenuHomePhabricator

Create cluster32 and cluster33 in existing es6 and es7 hosts
Closed, ResolvedPublic

Description

This is a similar request to that of T342685, but for the new clusters, split the logical (not physical) read write active databases so backup and (more importantly) recovery goes from hours to minutes. In theory, all the automation was done there and it should be much easier this time. Still, it is a relatively dangerous maintenance to not do lightly.

Tentative checklist (feel free to correct):

Based on comments by Amir, we may want to split existing available space in 1/3 of total space (every ~1-2 years).

Currently, backups are taking 9h30m to run, which is close to the 12h of the alert we setup at T346233 . We want to do this because it is needed primarily, not because of the alert (which is there in case we forget). This is not urgent, so filing it well in advance of becoming urgent.

Event Timeline

Marostegui triaged this task as Medium priority.Mar 31 2026, 8:13 AM
Ladsgroup moved this task from Ready to In progress on the DBA board.

Just created the tables. The script needs some changes which I will update.

Change #1268549 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[operations/mediawiki-config@master] ExternalStore: Start reading and writing from clusters 32 and 33

https://gerrit.wikimedia.org/r/1268549

Change #1268549 merged by jenkins-bot:

[operations/mediawiki-config@master] ExternalStore: Start reading and writing from clusters 32 and 33

https://gerrit.wikimedia.org/r/1268549

Mentioned in SAL (#wikimedia-operations) [2026-04-13T16:56:48Z] <ladsgroup@deploy1003> Started scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-13T16:58:24Z] <ladsgroup@deploy1003> ladsgroup: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-13T17:03:31Z] <ladsgroup@deploy1003> Finished scap sync-world: Backport for [[gerrit:1268549|ExternalStore: Start reading and writing from clusters 32 and 33 (T421729)]] (duration: 06m 43s)

I will update the docs and co tomorrow.

Change #1270947 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/WikimediaMaintenance@master] make-all-blobs: Fix path

https://gerrit.wikimedia.org/r/1270947

Change #1270947 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] make-all-blobs: Fix path

https://gerrit.wikimedia.org/r/1270947

@Ladsgroup could I reopen this and assign it to me for the changes needed on the backup side?

Ladsgroup reassigned this task from Ladsgroup to jcrespo.

Sure!

Change #1271728 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Perform a ro backup & start backing up only the latest 2 clusters

https://gerrit.wikimedia.org/r/1271728

Change #1271730 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Backup only regularly clusters 32 & 33, the read-write ones

https://gerrit.wikimedia.org/r/1271730

Change #1271728 merged by Jcrespo:

[operations/puppet@production] dbbackups: Perform a ro backup & start backing up only the latest 2 clusters

https://gerrit.wikimedia.org/r/1271728

New backups worked, they went from almost 10h and 2.2TB to 8 minutes and 23GB.

I will merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1271730 and archive the old backups into the read only section before closing this.

Change #1271730 merged by Jcrespo:

[operations/puppet@production] dbbackups: Backup only regularly clusters 32 & 33, the read-write ones

https://gerrit.wikimedia.org/r/1271730

jcrespo reassigned this task from jcrespo to Ladsgroup.

This is now tested, will do the archiving in another task as part of decommissioning backup1003/backup2003 to prevent duplicate work.