We have an alert on mysql_slave_status_last_errno{job="toolsdb-mariadb",project="tools"} != 0 but that doesn't cover all the ways in which replication can break.
Due to T349695, replication stopped with this error, but no alert triggered:
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595
There are a few other prometheus variables that we can check:
- mysql_slave_status_slave_io_running
- mysql_slave_status_slave_sql_running
- mysql_slave_status_last_io_errno
- mysql_slave_status_last_sql_errno