Create prometheus alert to detect lag spikes
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Marostegui
	May 19 2020, 1:07 PM

Description

The root cause of T252952: Wikidata dispatching slow and maxlag high on Wikidata due to db1101 replication lag was a host that was lagging behind, just enough to cause MW issues, but not enough to get pages sent out.
This is the graph:

Captura de pantalla 2020-05-19 a las 15.04.20.png (821×1 px, 126 KB)

This is not the first time this happens, and tends to happen when big maintenance scripts run, specially on wikidatawiki, that cause 5-10 seconds sustained lag.

As discussed, this is hard to detect, but maybe a start can be to create a prometheus alert that would notify us when a host has such a pattern of spikes over a period of time.
This will need tuning as it is prone to cause lots of falses positives. But it is a start to prevent this from happening again.

Details

Subject	Repo	Branch	Lines +/-
mariadb-backups: Remove replication lag avg check from backup sources	operations/puppet	production	+0 -10
mariadb: Disable prolonged-lag check for non-replication cases.	operations/puppet	production	+16 -9
mariadb: Add monitoring for lag spikes (v2)	operations/puppet	production	+47 -0
nagios_common: Add data persistence irc bot config	operations/puppet	production	+18 -0
mariadb: Add monitoring for lag spikes.	operations/puppet	production	+65 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T172492 Database alerting
		Resolved		• Kormat	T253120 Create prometheus alert to detect lag spikes

Event Timeline

• Marostegui created this task.May 19 2020, 1:07 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 19 2020, 1:07 PM

• Marostegui triaged this task as Medium priority.May 19 2020, 1:07 PM

• Marostegui moved this task from Triage to Backlog on the DBA board.

• Marostegui added a parent task: T172492: Database alerting.

• Marostegui mentioned this in T112473: Better mysql monitoring for number of connections and processlist strange patterns.

ArielGlenn subscribed.May 19 2020, 2:31 PM

Addshore mentioned this in T253746: high dispatch lag in Wikidata (27 May 2020).May 28 2020, 8:40 AM

Addshore awarded a token.May 28 2020, 8:44 AM

Addshore subscribed.

This happened again on Sunday: https://grafana.wikimedia.org/d/000000273/mysql?panelId=6&fullscreen&orgId=1&from=1591486181606&to=1591537428665&var-dc=eqiad%20prometheus%2Fops&var-server=db1092&var-port=9104

• Kormat claimed this task.Jun 11 2020, 12:55 PM

• Kormat moved this task from Backlog to In progress on the DBA board.

Change 605188 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Detect lag spikes

https://gerrit.wikimedia.org/r/605188

gerritbot added a project: Patch-For-Review.Jun 12 2020, 10:53 AM

Change 606441 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add monitoring for lag spikes.

https://gerrit.wikimedia.org/r/606441

Change 606441 merged by Kormat:
[operations/puppet@production] mariadb: Add monitoring for lag spikes.

https://gerrit.wikimedia.org/r/606441

Change 607014 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] nagios_common: Add data persistence irc bot config

https://gerrit.wikimedia.org/r/607014

Change 607014 merged by Kormat:
[operations/puppet@production] nagios_common: Add data persistence irc bot config

https://gerrit.wikimedia.org/r/607014

Change 607039 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Add monitoring for lag spikes (v2)

https://gerrit.wikimedia.org/r/607039

• Kormat added a project: User-Kormat.Jun 26 2020, 10:28 AM

• Kormat moved this task from Unsorted 💣 to Active 🚁 on the User-Kormat board.

• Kormat moved this task from Active 🚁 to Patch for Review on the User-Kormat board.Jun 26 2020, 1:29 PM

Change 607039 merged by Kormat:
[operations/puppet@production] mariadb: Add monitoring for lag spikes (v2)

https://gerrit.wikimedia.org/r/c/operations/puppet/ /607039

The alert is now active, stalling this until we have some actionable feedback about how to tune it.

Also: we need to upgrade or create a new section with how to proceed if this alert fires up

• Kormat moved this task from Patch for Review to Blocked 🚧 on the User-Kormat board.Jun 29 2020, 7:48 AM

Change 608271 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Disable prolonged-lag check for non-replication cases.

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608271

Change 608271 merged by Kormat:
[operations/puppet@production] mariadb: Disable prolonged-lag check for non-replication cases.

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608271

• Kormat moved this task from In progress to Pending comment on the DBA board.Jul 7 2020, 6:04 AM

• Marostegui moved this task from Pending comment to In progress on the DBA board.Aug 11 2020, 6:17 AM

Change 619729 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb-backups: Remove replication lag avg check from backup sources

https://gerrit.wikimedia.org/r/619729

Unstalling it...I think we can actually close this no? It's been working without many false positives (apart from backup sources, which is addressed by Jaime's commit above)

Previous check, and some previous comment is because the prometheus-based alert doesn't work well when replication is stopped, FYI (we get unknowns). I sent the patch as an immediate workaround, but not sure how to approach the overall limitation.

It's been working without many false positives

Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it if necessary.

I think in general, the workflow for any host that will get replication stopped is to downtime it first.
In that sense, the backup sources get excluded from this workflow, but I believe those are the only hosts with that particular workflow - so if you make your patch work I think we'd be good.

In T253120#6379200, @jcrespo wrote:

It's been working without many false positives

Could we do a last test keeping a replica on codfw with 2 seconds of constant lag just to be sure? I can arrange it if necessary.

Go for it!

Mentioned in SAL (#wikimedia-operations) [2020-08-12T11:49:58Z] <jynus> creating artificial low replication lag on db2130 to test icinga alerts T253120

Did:

STOP SLAVE; CHANGE MASTER TO MASTER_DELAY=2; START SLAVE;

And the alert happened nicely.

The only edge case, other than the stop slaves on dbstores, is a potential replication breakage/stop- but that would be caught by the other alerts, so no worries. Ok to close after last patch merged.

Change 619729 merged by Jcrespo:
[operations/puppet@production] mariadb-backups: Remove replication lag avg check from backup sources

https://gerrit.wikimedia.org/r/619729

• jcrespo awarded a token.Aug 12 2020, 12:16 PM

Thank you @Kormat for working on this, we've wanted to have this alert for a long time!

• jcrespo mentioned this in T288056: Rename the "databases-testing" Icinga contact group.Aug 4 2021, 1:02 PM

• jcrespo mentioned this in T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.Jun 19 2024, 8:41 PM

Maintenance_bot moved this task from In progress to Done on the DBA board.Jun 19 2024, 9:29 PM

	F31832918: Captura de pantalla 2020-05-19 a las 15.04.20.png
	May 19 2020, 1:07 PM

Create prometheus alert to detect lag spikesClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create prometheus alert to detect lag spikes
Closed, ResolvedPublic
Actions

Related Objects
Search...