[toolsdb] no alert if replication stops because of IO error
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	fnegri
	Nov 10 2023, 1:44 PM

Description

We have an alert on mysql_slave_status_last_errno{job="toolsdb-mariadb",project="tools"} != 0 but that doesn't cover all the ways in which replication can break.

Due to T349695, replication stopped with this error, but no alert triggered:

Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Read invalid event from master: 'Found invalid event in binary log', master could be corrupt but a more likely cause of this is a bug
Oct 28 11:31:41 tools-db-2 mysqld[648831]: 2023-10-28 11:31:41 11 [ERROR] Slave I/O: Relay log write failure: could not queue event from master, Internal MariaDB error code: 1595

There are a few other prometheus variables that we can check:

mysql_slave_status_slave_io_running
mysql_slave_status_slave_sql_running
mysql_slave_status_last_io_errno
mysql_slave_status_last_sql_errno

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		fnegri	T349695 [toolsdb] MariaDB process is killed by OOM killer (October 2023)
		Resolved		fnegri	T350943 [toolsdb] no alert if replication stops because of IO error

Event Timeline

fnegri created this task.Nov 10 2023, 1:44 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2023, 1:44 PM

All the values in prometheus map to values in SHOW SLAVE STATUS. Last_Errno that we're currently monitoring is an alias for Last_SQL_Errno, but doesn't cover Last_IO_Errno. I'm not sure if we should monitor Slave_IO_Running and Slave_SQL_Running as well.

fnegri renamed this task from [toolsdb] no alert if replication stops to [toolsdb] no alert if replication stops because of IO error.Nov 10 2023, 1:49 PM

ABran-WMF subscribed.Nov 10 2023, 2:10 PM

I've added a few suggestions on alerts that could be helpful. I can also refer you to the conversation on this ticket T315866. From what I've been able to read comparing our version to prometheus' it seems realistic to either use the off the shelf version or just add the few nuances that could be missing in a fork.

ABran-WMF added a project: DBA.Nov 10 2023, 2:35 PM

Marostegui moved this task from Triage to In progress on the DBA board.Nov 10 2023, 2:50 PM

Marostegui edited projects, added Data-Persistence; removed DBA.

fnegri changed the task status from Open to In Progress.Nov 10 2023, 5:18 PM

fnegri moved this task from Backlog to In progress on the cloud-services-team (FY2023/2024-Q1-Q2) board.

Thanks @ABran-WMF, I copied one of your suggestions to create a new "ReplicationMissing" alert. I also tweaked our existing alert to cover both last_sql_errno and last_io_errno.

Our alerts are now looking like this:

MariaDB [prometheusconfig]> select * from alerts where id in (8,9,23)\G
*************************** 1. row ***************************
         id: 8
 project_id: 12
       name: ToolsDBReplicationError
       expr: mysql_slave_status_last_sql_errno{project="tools"} + mysql_slave_status_last_io_errno{project="tools"} != 0
   duration: 5m
   severity: critical
annotations: {"summary": "ToolsDB replication is broken on {{ $labels.instance }} (errno {{ $value }})", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
*************************** 2. row ***************************
         id: 9
 project_id: 12
       name: ToolsDBReplicationLagIsTooHigh
       expr: mysql_slave_status_seconds_behind_master{project="tools"} > 3600
   duration: 1m
   severity: warning
annotations: {"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
*************************** 3. row ***************************
         id: 23
 project_id: 12
       name: ToolsDBReplicationMissing
       expr: mysql_global_status_slaves_running{project="tools"}+mysql_global_status_slaves_connected{project="tools"} == 0
   duration: 5m
   severity: critical
annotations: {"summary": "ToolsDB replication is not running on {{ $labels.instance }} (errno {{ $value }})", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication"}
3 rows in set (0.002 sec)

I also renamed the existing runbook for ReplicationLag to cover all 3 alerts: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/ToolsDBReplication

@taavi @ABran-WMF could you please review the alerts in my previous comment and let me know if you would add or change anything?

I checked that both ToolsDBReplicationMissing and ToolsDBReplicationError would have fired last week, as expected.

An interesting aside is that both mysql_global_status_slaves_running and mysql_global_status_slave_running (without the s) exist, but have different meanings and behaviours... they map to Slaves_running and Slave_running as described here. We could check both, and optionally also check slave_io_running and slave_sql_running, but I think what I added so far should cover most (or hopefully all!) of the edge cases.

I think creating a synthetic indicator of slave_io_running + slave_sql_running and check if the value is < 2 could be good. Otherwise the alerts you're suggesting are looking great!

@ABran-WMF I thought of adding something similar (slave_io_running + slave_sql_running > 2), but I suspect then I would see a duplicate alert every time there is a replication error, because both that alert would fail, and the existing one checking mysql_slave_status_last_sql_errno + mysql_slave_status_last_io_errno. If you cannot think of a situation where one alert would trigger, but the other would not, I'm inclined to only keep the existing alert for now.

Resolving for now, we can open a new task if we find edge cases where the current alerts are not enough.

fnegri moved this task from In progress to Done on the cloud-services-team (FY2023/2024-Q1-Q2) board.Nov 16 2023, 6:44 PM

[toolsdb] no alert if replication stops because of IO errorClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[toolsdb] no alert if replication stops because of IO error
Closed, ResolvedPublic
Actions

Related Objects
Search...