Update October 2023
We have delayed completing this ticket for some time, but it would now be beneficial to move forward with it.
- an-coord100[1-2] and ready to be refershed with an-coord100[3-4]
- The new servers for the analytics-meta database are: an-mariadb100[1-2]
We still don't have a planned method for managed failover/failback of the MariaDB servers.
Prior status
- an-coord1001 runs the 'analytics meta' MariaDB master instance. This instance has several databases for Analytics Cluster operations.
- an-coord1002 runs a standby replica of this instance, but in the case of a failure, switching to an-coord1002 is a error prone and manual process.
- matomo1002 runs a MariaDB instance for the 'piwik' database.
- db1208 runs backup replicas of analytics-meta and matamo MariaDB instances, and backula is used to keep historical backups.
- Relevant MariaDB configs do not necessarily match between masters and replicas.
Desired status
- All existing analytics_meta databases running from an-mariadb100[12] instead of an-coord100[12]
- We have confidence in the veracity of both the failover replica (an-mariadb1002) and the backup replica (db1208)
- Regular and comprehensive backups are running from db1208
- The failover method from an-mariadb1001 to an-mariadb1002 has been well-defined and tested
- The restore method from db1208 has been well defined
Implementation steps
- Dedicated DB hardware to be ordered in Q1 FY2021-2022 to replace an-coord100[12]: an-db100[12].
- an-coord1002 fully in sync with an-coord1001 and ready for failover.
- db1208 fully recreated from snapshot of an-coord1001 and performing regular backups.
- an-mariadb1001 instantiated as a replica of an-coord1001
- an-mariaddb1002 instantiated as a replica of an-mariadb1001
- an-mariadb1002 switched to replicate from an-mariadb1001
- db1208 switched to replicate from an-mariadb1001
Switch-over time
- an-coord1001 switched to read-only
- an-mariadb1001 promoted to master
- All applications switched to use an-mariadb1001 instead of an-coord1001
Post Switch-over time
- Ensure backups are running on the right host(s) and with the latest data (e.g. attempt a test recovery)
- MariaDB instances removed from ab-coord100[12]
Notes and Migration Plan here:
https://etherpad.wikimedia.org/p/analytics-meta
Originally, this ticket was setting up multi master instances and being able to do failover for individual MariaDB database instances. However, it was discovered that Data Persistence does not really support MariaDB multi instance master setups, and the reasons for us doing so aren't really that useful. Most of the time, failovers will be manual and done for hardware reasons, meaning all DBs would have to be failed over anyway. Having many master setups means more replicas and binlogs to manage, which makes maintenance like that harder, not easier. Ideally each app's DB would be totally isolated from the others, but we will have to wait until perhaps one day we get persitent volumes in k8s to do this really properly.
For now we are going with a single analytics-meta instance for all databases.
TBD - The matomo server is in the private1 vlan. Do we want to move its database to an-mariadb100[12] and require a new hardware firewall rule for this?