In T231208 some issue were highlighted:
- the analytics dbs (matomo, superset, oozie, etc..) are all running on single db hosts without any replication, and taking backups following SRE best practices leads to issue like excessive lock contention between applications and backup software.
- the current backups of the Analytics databases diverged a lot from the best practices that SRE follows and might lead to inconsistent snapshots in some cases.
In T231858 some issues were highlighted:
- the log database on db1107 is way different from db1108, they can't be really exchanged without user noticing it in their query results.
- the logdatabase contains historical data that would be nice to have available (read-only) for more time before completely relying on HDFS data. We still have the past 1.5y of Eventlogging data on HDFS already, we are sunsetting the Mysql support.
- moving the log database to one of the dbstore nodes would require a lot of engineering time and probably not be the best solution in terms of availability and resource usage of the dbstore cluster.
Given the above points, I have a proposal for db1108:
- after the deprecation of mysql-eventlogging, remove all eventlogging-related replication code.
- repurpose it as generic analytics database replica: keep the log database as it is, and replicate the matomo, superset, etc.. from the Analytics db hosts (an-coord1001, matomo1001)
- add mariadb-bacula backups configuration for db1108
Important note about the log database: the plan is to take a full snapshot of the db and archive it in HDFS before starting any procedure. It will be made clear that the log database will be available as read-only support with the caveat that any maintenance or host hw-event will require downtime. The Analytics team is going to maintain the log database as best effort support and it will be made clear to users.
How does the proposal sound?