Current status
- an-coord1001 runs the 'analytics meta' MariaDB master instance. This instance has several databases for Analytics Cluster operations.
- an-coord1002 runs a standby replica of this instance, but in the case of a failure, switching to an-coord1002 is a error prone and manual process.
- matomo1002 runs a MariaDB instance for the 'piwik' database.
- db1108 runs backup replicas of analytics-meta and matamo MariaDB instances, and backula is used to keep historical backups.
- As described in T279440: Data drifts between superset_production on an-coord1001 and db1108, the replicas do not match the masters.
- Relevant MariaDB configs do not necessarily match between masters and replicas.
Desired status
- All existing analytics_meta databases running from an-db100[12] instead of an-coord100[12]
- We have confidence in the veracity of both the failover replica (an-db1002) and the backup replica (db11108)
- Regular and comprehensive backups are running from db1108
- The failover method from an-db1001 to an-db1002 has been well-defined and tested
- The restore method from db1108 has been well defined
Implementation steps
- Dedicated DB hardware to be ordered in Q1 FY2021-2022 to replace an-coord100[12]: an-db100[12].
- an-coord1002 fully in sync with an-coord1001 and ready for failover.
- db1108 fully recreated from snapshot of an-coord1001 and performing regular backups.
- an-db1001 instantiated as a replica of an-coord1001
- an-db1002 instantiated as a replica of an-coord1001
Switch-over time
- an-coord1001 switched to read-only
- an-db1001 promted to master
- All applications switched to use an-db1001 instead of an-coord1001
- an-db1002 replicating from an-db1001
- db1108 replicating from an-db1001
- MariaDB instances removed from ab-coord100[12]
Notes and Migration Plan here:
https://etherpad.wikimedia.org/p/analytics-meta
Originally, this ticket was setting up multi master instances and being able to do failover for individual MariaDB database instances. However, it was discovered that Data Persistence does not really support MariaDB multi instance master setups, and the reasons for us doing so aren't really that useful. Most of the time, failovers will be manual and done for hardware reasons, meaning all DBs would have to be failed over anyway. Having many master setups means more replicas and binlogs to manage, which makes maintenance like that harder, not easier. Ideally each app's DB would be totally isolated from the others, but we will have to wait until perhaps one day we get persitent volumes in k8s to do this really properly.
For now we are going with a single analytics-meta instance for all databases.
TBD - The matomo server is in the private1 vlan. Do we want to move its database to an-db100[12] and require a new hardware firewall rule for this?