Page MenuHomePhabricator

Create prometheus alert to detect lag spikes
Closed, ResolvedPublic

Description

The root cause of T252952: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag was a host that was lagging behind, just enough to cause MW issues, but not enough to get pages sent out.
This is the graph:

Captura de pantalla 2020-05-19 a las 15.04.20.png (821×1 px, 126 KB)

This is not the first time this happens, and tends to happen when big maintenance scripts run, specially on wikidatawiki, that cause 5-10 seconds sustained lag.

As discussed, this is hard to detect, but maybe a start can be to create a prometheus alert that would notify us when a host has such a pattern of spikes over a period of time.
This will need tuning as it is prone to cause lots of falses positives. But it is a start to prevent this from happening again.

Event Timeline

Kormat moved this task from Backlog to In progress on the DBA board.

Change 605188 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Detect lag spikes

https://gerrit.wikimedia.org/r/605188

Change 606441 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add monitoring for lag spikes.

https://gerrit.wikimedia.org/r/606441

Change 606441 merged by Kormat:
[operations/puppet@production] mariadb: Add monitoring for lag spikes.

https://gerrit.wikimedia.org/r/606441

Change 607014 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] nagios_common: Add data persistence irc bot config

https://gerrit.wikimedia.org/r/607014

Change 607014 merged by Kormat:
[operations/puppet@production] nagios_common: Add data persistence irc bot config

https://gerrit.wikimedia.org/r/607014

Change 607039 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add monitoring for lag spikes (v2)

https://gerrit.wikimedia.org/r/607039

Change 607039 merged by Kormat:
[operations/puppet@production] mariadb: Add monitoring for lag spikes (v2)

https://gerrit.wikimedia.org/r/c/operations/puppet/ /607039

Kormat changed the task status from Open to Stalled.Jun 29 2020, 7:47 AM

The alert is now active, stalling this until we have some actionable feedback about how to tune it.

Also: we need to upgrade or create a new section with how to proceed if this alert fires up

Change 608271 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Disable prolonged-lag check for non-replication cases.

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608271

Change 608271 merged by Kormat:
[operations/puppet@production] mariadb: Disable prolonged-lag check for non-replication cases.

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608271

Change 619729 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Remove replication lag avg check from backup sources

https://gerrit.wikimedia.org/r/619729

Marostegui changed the task status from Stalled to Open.Aug 12 2020, 11:21 AM

Unstalling it...I think we can actually close this no? It's been working without many false positives (apart from backup sources, which is addressed by Jaime's commit above)

Previous check, and some previous comment is because the prometheus-based alert doesn't work well when replication is stopped, FYI (we get unknowns). I sent the patch as an immediate workaround, but not sure how to approach the overall limitation.

It's been working without many false positives

Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it if necessary.

I think in general, the workflow for any host that will get replication stopped is to downtime it first.
In that sense, the backup sources get excluded from this workflow, but I believe those are the only hosts with that particular workflow - so if you make your patch work I think we'd be good.

It's been working without many false positives

Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it if necessary.

Go for it!

Mentioned in SAL (#wikimedia-operations) [2020-08-12T11:49:58Z] <jynus> creating artificial low replication lag on db2130 to test icinga alerts T253120

Did:

STOP SLAVE; CHANGE MASTER TO MASTER_DELAY=2; START SLAVE;

And the alert happened nicely.

The only edge case, other than the stop slaves on dbstores, is a potential replication breakage/stop- but that would be caught by the other alerts, so no worries. Ok to close after last patch merged.

Change 619729 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Remove replication lag avg check from backup sources

https://gerrit.wikimedia.org/r/619729

Thank you @Kormat for working on this, we've wanted to have this alert for a long time!