Last time, we spent around two days with replication not working before noticing.
This task is to create some alerting (email and/or prometheus alert) to notify when the replication stops working so we
can handle it.
Last time, we spent around two days with replication not working before noticing.
This task is to create some alerting (email and/or prometheus alert) to notify when the replication stops working so we
can handle it.
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T301951 toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication | |||
Open | None | T306453 toolsdb: review alerting | |||
Resolved | fnegri | T301994 [toolsdb] Add replication alerting |
I added an alert called ToolsDBReplicationLagIsTooHigh that will trigger if the replication lag is higher than 1 hour (3600 seconds).
To add it, I ssh-ed into metricsinfra-controller-2.metricsinfra.eqiad1.wikimedia.cloud and added the following row to the prometheusconfig database (credentials are in /etc/prometheus-manager/config.yaml):
MariaDB [prometheusconfig]> INSERT INTO alerts VALUES (9, 12, 'ToolsDBReplicationLagIsTooHigh', 'mysql_slave_status_seconds_behind_master{project="tools"} > 3600', " 1m", "warn", '{"summary": "ToolsDB replication on {{ $labels.instance }} is lagging behind the primary, the current lag is {{ $value }}"}');