Page MenuHomePhabricator

Alerts for mjolnir daemons
Closed, ResolvedPublic

Description

We need some alerts so we know if mjolnir starts misbehaving:

Event Timeline

EBernhardson moved this task from needs triage to Up Next on the Discovery-Search board.

Change 495693 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] prometheus::alerts: mjolnir bulk update lag

https://gerrit.wikimedia.org/r/495693

Change 495693 merged by Gehel:
[operations/puppet@production] elasticsearch: mjolnir bulk update lag

https://gerrit.wikimedia.org/r/495693

Problems with the mjolnir update for es6 means many docs failed this week, but no alert. Assuming this part of ticket wasn't completed:

document update result="failed" should stay at 0, if it doesn't then something is wrong: https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-30d&to=now

The icinga alert email came through just now, about 2.5 hours after the failures started. Perhaps it is simply misconfigured? The lag check should allow a couple days of lag, but the bulk failures should trigger an alert pretty quickly.

Change 499150 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] icinga: increase check_interval for bulk update failure

https://gerrit.wikimedia.org/r/499150

Change 499150 merged by Gehel:
[operations/puppet@production] icinga: increase mjolnir bulk update check frequency

https://gerrit.wikimedia.org/r/499150

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Mjolnir+bulk+update+failure+check+-+codfw

Current Status: CRITICAL
(for 0d 6h 0m 16s)
Status Information: 121 gt 2

It's unclear what "121 gt 2" refers to.

@Dzahn This check alerts if scalar(sum(increase(mjolnir_bulk_action_total{result="failed"}[24h] is greater than 0. That is, we should not have mjolnir failure. Normally the value for warning and critical could 1, but check_prometheus does not allow that which makes sense.

Zero is perhaps too low of a threshold. While not expected, if 1 or 100 updates fail in a short blip it's not a big deal. I think what we want to know is if it starts failing at some significant percentage that requires intervention. Perhaps if >1% of updates fail?

Also i don't know if we have the distinction, but since this is a kafka consumer we can replay old events, there is no urgency to these alerts. Someone needs to know and do something, but nothing needs to be done at that instant except perhaps stopping the daemons if it's causing issues downstream.

Change 516526 had a related patch set uploaded (by Mathew.onipe; owner: Mathew.onipe):
[operations/puppet@production] A more flexible approach for mjolnir update lag

https://gerrit.wikimedia.org/r/516526

Change 516526 merged by Gehel:
[operations/puppet@production] A more flexible approach for mjolnir update lag

https://gerrit.wikimedia.org/r/516526