Page MenuHomePhabricator

[toolsdb] no alert if replication stops because of IO error
Closed, ResolvedPublic

Description

We have an alert on mysql_slave_status_last_errno{job="toolsdb-mariadb",project="tools"} != 0 but that doesn't cover all the ways in which replication can break.

Due to T349695, replication stopped with this error, but no alert triggered:

Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595

There are a few other prometheus variables that we can check:

  • mysql_slave_status_slave_io_running
  • mysql_slave_status_slave_sql_running
  • mysql_slave_status_last_io_errno
  • mysql_slave_status_last_sql_errno

Event Timeline

All the values in prometheus map to values in SHOW SLAVE STATUS. Last_Errno that we're currently monitoring is an alias for Last_SQL_Errno, but doesn't cover Last_IO_Errno. I'm not sure if we should monitor Slave_IO_Running and Slave_SQL_Running as well.

fnegri renamed this task from [toolsdb] no alert if replication stops to [toolsdb] no alert if replication stops because of IO error.Nov 10 2023, 1:49 PM

I've added a few suggestions on alerts that could be helpful. I can also refer you to the conversation on this ticket T315866. From what I've been able to read comparing our version to prometheus' it seems realistic to either use the off the shelf version or just add the few nuances that could be missing in a fork.

Marostegui edited projects, added Data-Persistence; removed DBA.
fnegri changed the task status from Open to In Progress.Nov 10 2023, 5:18 PM

Thanks @ABran-WMF, I copied one of your suggestions to create a new "ReplicationMissing" alert. I also tweaked our existing alert to cover both last_sql_errno and last_io_errno.

Our alerts are now looking like this:

MariaDB [prometheusconfig]> select * from alerts where id in (8,9,23)\G
*************************** 1. row ***************************
         id: 8
 project_id: 12
       name: ToolsDBReplicationError
       expr: mysql_slave_status_last_sql_errno{project="tools"} + mysql_slave_status_last_io_errno{project="tools"} != 0
   duration: 5m
   severity: critical
annotations: {"summary": "ToolsDB replication is broken on {{ $labels.instance }} (errno {{ $value }})", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
*************************** 2. row ***************************
         id: 9
 project_id: 12
       name: ToolsDBReplicationLagIsTooHigh
       expr: mysql_slave_status_seconds_behind_master{project="tools"} > 3600
   duration: 1m
   severity: warning
annotations: {"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
*************************** 3. row ***************************
         id: 23
 project_id: 12
       name: ToolsDBReplicationMissing
       expr: mysql_global_status_slaves_running{project="tools"}+mysql_global_status_slaves_connected{project="tools"} == 0
   duration: 5m
   severity: critical
annotations: {"summary": "ToolsDB replication is not running on {{ $labels.instance }} (errno {{ $value }})", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
3 rows in set (0.002 sec)

I also renamed the existing runbook for ReplicationLag to cover all 3 alerts: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication

fnegri triaged this task as Medium priority.Nov 14 2023, 5:26 PM

@taavi @ABran-WMF could you please review the alerts in my previous comment and let me know if you would add or change anything?

I checked that both ToolsDBReplicationMissing and ToolsDBReplicationError would have fired last week, as expected.

An interesting aside is that both mysql_global_status_slaves_running and mysql_global_status_slave_running (without the s) exist, but have different meanings and behaviours... they map to Slaves_running and Slave_running as described here. We could check both, and optionally also check slave_io_running and slave_sql_running, but I think what I added so far should cover most (or hopefully all!) of the edge cases.

I think creating a synthetic indicator of slave_io_running + slave_sql_running and check if the value is < 2 could be good. Otherwise the alerts you're suggesting are looking great!

@ABran-WMF I thought of adding something similar (slave_io_running + slave_sql_running > 2), but I suspect then I would see a duplicate alert every time there is a replication error, because both that alert would fail, and the existing one checking mysql_slave_status_last_sql_errno + mysql_slave_status_last_io_errno. If you cannot think of a situation where one alert would trigger, but the other would not, I'm inclined to only keep the existing alert for now.

Resolving for now, we can open a new task if we find edge cases where the current alerts are not enough.