Maniphest T301994

[toolsdb] Add replication alerting
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dcaro
	Feb 17 2022, 3:22 PM

Description

Last time, we spent around two days with replication not working before noticing.

This task is to create some alerting (email and/or prometheus alert) to notify when the replication stops working so we
can handle it.

Related Objects
Search...

Status	Assigned	Task
Resolved	dcaro	T301951 toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication
Open	None	T306453 toolsdb: review alerting
Resolved	fnegri	T301994 [toolsdb] Add replication alerting

Event Timeline

dcaro triaged this task as High priority.Feb 17 2022, 3:22 PM

dcaro created this task.

RhinosF1 subscribed.Feb 17 2022, 4:29 PM

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 6:39 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

I added an alert called ToolsDBReplicationLagIsTooHigh that will trigger if the replication lag is higher than 1 hour (3600 seconds).

To add it, I ssh-ed into metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and added the following row to the prometheusconfig database (credentials are in /etc/prometheus-manager/config.yaml):

MariaDB [prometheusconfig]> INSERT INTO alerts VALUES (9, 12, 'ToolsDBReplicationLagIsTooHigh', 'mysql_slave_status_seconds_behind_master{project="tools"} > 3600', "
1m", "warn", '{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');

fnegri edited projects, added cloud-services-team (FY2022/2023-Q4); removed cloud-services-team.Apr 17 2023, 1:04 PM

fnegri moved this task from Backlog to Done on the cloud-services-team (FY2022/2023-Q4) board.

fnegri mentioned this in T334925: ToolsDB: setup pt-heartbeat replication monitor.Apr 18 2023, 10:30 AM

fnegri added a parent task: T306453: toolsdb: review alerting.Apr 28 2023, 4:39 PM

fnegri mentioned this in T306453: toolsdb: review alerting.

fnegri mentioned this in T326332: No alert when ToolsDB replication lag is too high.Aug 8 2023, 9:18 AM

fnegri merged a task: T326332: No alert when ToolsDB replication lag is too high.

fnegri added a subscriber: Aklapper.