Page MenuHomePhabricator

[toolsdb] Add replication alerting
Closed, ResolvedPublic

Description

Last time, we spent around two days with replication not working before noticing.

This task is to create some alerting (email and/or prometheus alert) to notify when the replication stops working so we
can handle it.

Event Timeline

dcaro triaged this task as High priority.Feb 17 2022, 3:22 PM
dcaro created this task.
fnegri closed this task as Resolved.EditedApr 17 2023, 1:02 PM
fnegri claimed this task.
fnegri added subscribers: taavi, fnegri.

I added an alert called ToolsDBReplicationLagIsTooHigh that will trigger if the replication lag is higher than 1 hour (3600 seconds).

To add it, I ssh-ed into metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and added the following row to the prometheusconfig database (credentials are in /etc/prometheus-manager/config.yaml):

MariaDB [prometheusconfig]> INSERT INTO alerts VALUES (9, 12, 'ToolsDBReplicationLagIsTooHigh', 'mysql_slave_status_seconds_behind_master{project="tools"} > 3600', "
1m", "warn", '{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');