This is a similar request to that of T342685, but for the new clusters, split the logical (not physical) read write active databases so backup and (more importantly) recovery goes from hours to minutes. In theory, all the automation was done there and it should be much easier this time. Still, it is a relatively dangerous maintenance to not do lightly.
Tentative checklist (feel free to correct):
- Create all the new tables
- Double check mediawiki can read and write to the new tables/application grants are correct
- Deploy config to make them the default https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/963720
- Update grants for backups / Move the (now new) ro backups to the read-only location, instead of the read-write
Based on comments by Amir, we may want to split existing available space in 1/3 of total space (every ~1-2 years).
Currently, backups are taking 9h30m to run, which is close to the 12h of the alert we setup at T346233 . We want to do this because it is needed primarily, not because of the alert (which is there in case we forget). This is not urgent, so filing it well in advance of becoming urgent.