Page MenuHomePhabricator

Clean up failure ratio monitoring and set up an alarm when it goes more than a certain threshold
Closed, ResolvedPublic

Description

See Incident report

We want to alert when the proportion of revisions that cannot be scored for a wiki rises above a certain threshold.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Halfak triaged this task as High priority.Feb 16 2017, 3:58 PM
Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Added a group for scoring platform team: https://grafana-admin.wikimedia.org/alerting/notification/5/edit
Then added an alert when average of five minutes is above 10%, https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1

It should be enough for now but I'm not sure if the whole grafana alert system works or not. @Krinkle does it work? The "Send Test" button gives error to me "SMTP not set".

@Ladsgroup Grafana does not have outgoing E-mail configured. Instead of maintaining a separate list of contact groups and protocols for Grafana, it was decided to re-use the existing Icinga infrastructure for this.

See https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Alerts_(with_notifications_via_Icinga) for more information.

Short story: Alerts can be fully configured and maintained within Grafana. The only thing needed elsewhere is a one line configuration change (in Puppet) to enable Icinga alerts for a particular dashboard. Only the dashboard name and Icinga contact group name need to be specified. The rest remains dynamic and within Grafana only (including the individual alert names and their underlying queries etc.)

As for whether it works, I'd say yes. Performance Team regularly gets alert e-mails from its various dashboards. And I assume that after @Halfak (Thank you!) wrote the above docs, that it worked.

Thanks for the help. I will make the patches for that.

Change 399109 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] icinga: Add scoring-team for alerts of ores-extension

https://gerrit.wikimedia.org/r/399109

Change 399109 merged by Dzahn:
[operations/puppet@production] icinga: Add scoring-team for alerts of ores-extension

https://gerrit.wikimedia.org/r/399109

Everything seems fine now, I wish we could build similar screaming system for beta cluster as well but all metrics are dead there: https://grafana-labs.wikimedia.org/dashboard/db/ores-extension?orgId=1 https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1&from=now-7d&to=now
Will look into this later on.