The idea is to make mainstash (currently x2) similar to PC: {T373037}
Currently, for mainstash we have x2 which is one primary plus two idle replicas per datacenter. It is an extremely hardpainful part of the infrastructure to maintain from DBAs perspectivefrom DBAs perspective. So much that currently when we need to do reboots, we just reboot the servers live since that causes less user-facing errors than doing a switchover.
The proposal is to introduce x4 and move one replica to be the primary of x4 (without any replicas).three sections: ms1, ms2, MediaWiki would split the writes at randomms3 each having a primary db only (per dc). In case we need to do maintenance (reboots or hardware issues),MediaWiki would split the writes at random but it would write to 2 out of three. we depool one section and let the keys to failover to the other section.In case we need to do maintenance (reboots or hardware issues), This means from time to time 50% of the keys will be lost temporarily until the new host is back in rotation this might take a couple of minutes (for reboot) or days (for hardware issues)we depool one section and let the keys to failover to the other sections. Also given that we now have [[https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1101205|objectcache: Automatically fallback to the second db in line]]When doing the reads, MediaWiki would also read from two hosts as well and if one is missing or has lower exptime, if we set up x4,it gets ignored. any hardware issue automatically fails over to the otherThat prevents full loss of data in case we depool a section.
Open questions;
- What would be the user impact of depools?
- Should we have two sections or three? if we are going with three, I suggest coming with a new cluster name, like ms1, ms2, ms3 if we are going with two, we can move the extra replica to ParserCache but what if two hosts end up with hw issues at the same time.
- Maybe we should simply stop using MySQL for this?
There will be still failure scenarios such as when TTL changes or set to indef or when we have to depool two sections, - Another idea would be to set up mainstash as VMs since x2 databases are tiny,etc. we can go even with four VMs and then move the three bare metal hosts to PCbut all seem to be well under error budget.