Alerts for mjolnir daemons
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	EBernhardson
	Jan 23 2019, 5:15 PM

Description

We need some alerts so we know if mjolnir starts misbehaving:

The bulk updates run once a week and take a day or two to load. If the kafka consumer lag gets to 3 days then something is wrong
document update result="failed" should stay at 0, if it doesn't then something is wrong: https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-30d&to=now

Details

Subject	Repo	Branch	Lines +/-
A more flexible approach for mjolnir update lag	operations/puppet	production	+1 -1
icinga: increase mjolnir bulk update check frequency	operations/puppet	production	+1 -3
elasticsearch: mjolnir bulk update lag	operations/puppet	production	+31 -2

Customize query in gerrit

Related Objects

Mentioned In: T225904: Mjolnir bulk update failure check - eqiad

Event Timeline

EBernhardson created this task.Jan 23 2019, 5:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 23 2019, 5:15 PM

EBernhardson triaged this task as Medium priority.Jan 24 2019, 6:07 PM

EBernhardson moved this task from needs triage to Up Next on the Discovery-Search board.

debt moved this task from Up Next to Ops / SRE on the Discovery-Search board.Jan 29 2019, 6:42 PM

• Mathew.onipe claimed this task.Mar 11 2019, 10:26 AM

• Mathew.onipe edited projects, added Discovery-Search (Current work), Icinga; removed Discovery-Search.

Change 495693 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] prometheus::alerts: mjolnir bulk update lag

https://gerrit.wikimedia.org/r/495693

gerritbot added a project: Patch-For-Review.Mar 11 2019, 2:10 PM

Change 495693 merged by Gehel:
[operations/puppet@production] elasticsearch: mjolnir bulk update lag

https://gerrit.wikimedia.org/r/495693

• Mathew.onipe moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Mar 15 2019, 2:19 PM

debt closed this task as Resolved.Mar 15 2019, 9:02 PM

Problems with the mjolnir update for es6 means many docs failed this week, but no alert. Assuming this part of ticket wasn't completed:

document update result="failed" should stay at 0, if it doesn't then something is wrong: https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-30d&to=now

EBernhardson moved this task from Needs Reporting to Incoming on the Discovery-Search (Current work) board.Mar 25 2019, 7:01 PM

The icinga alert email came through just now, about 2.5 hours after the failures started. Perhaps it is simply misconfigured? The lag check should allow a couple days of lag, but the bulk failures should trigger an alert pretty quickly.

Change 499150 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: increase check_interval for bulk update failure

https://gerrit.wikimedia.org/r/499150

Change 499150 merged by Gehel:
[operations/puppet@production] icinga: increase mjolnir bulk update check frequency

https://gerrit.wikimedia.org/r/499150

• Mathew.onipe moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Mar 26 2019, 11:51 AM

debt closed this task as Resolved.Apr 5 2019, 10:48 PM

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Mjolnir+bulk+update+failure+check+-+codfw

Current Status: CRITICAL
(for 0d 6h 0m 16s)
Status Information: 121 gt 2

It's unclear what "121 gt 2" refers to.

Dzahn reopened this task as Open.May 28 2019, 7:55 PM

Maintenance_bot removed a project: Patch-For-Review.May 28 2019, 8:11 PM

debt moved this task from Needs Reporting to Needs review on the Discovery-Search (Current work) board.May 28 2019, 11:49 PM

@Dzahn This check alerts if scalar(sum(increase(mjolnir_bulk_action_total{result="failed"}[24h] is greater than 0. That is, we should not have mjolnir failure. Normally the value for warning and critical could 1, but check_prometheus does not allow that which makes sense.

Zero is perhaps too low of a threshold. While not expected, if 1 or 100 updates fail in a short blip it's not a big deal. I think what we want to know is if it starts failing at some significant percentage that requires intervention. Perhaps if >1% of updates fail?

Also i don't know if we have the distinction, but since this is a kafka consumer we can replay old events, there is no urgency to these alerts. Someone needs to know and do something, but nothing needs to be done at that instant except perhaps stopping the daemons if it's causing issues downstream.

• Mathew.onipe moved this task from Needs review to Incoming on the Discovery-Search (Current work) board.Jun 4 2019, 5:17 PM

Change 516526 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] A more flexible approach for mjolnir update lag

https://gerrit.wikimedia.org/r/516526

gerritbot added a project: Patch-For-Review.Jun 11 2019, 5:45 PM

Change 516526 merged by Gehel:
[operations/puppet@production] A more flexible approach for mjolnir update lag

https://gerrit.wikimedia.org/r/516526

Maintenance_bot removed a project: Patch-For-Review.Jun 12 2019, 4:10 PM

dcausse mentioned this in T225904: Mjolnir bulk update failure check - eqiad.Jun 17 2019, 6:58 AM

Gehel moved this task from Incoming to Needs Reporting on the Discovery-Search (Current work) board.Jul 2 2019, 5:27 PM

debt closed this task as Resolved.Jul 3 2019, 4:21 PM