Clean up failure ratio monitoring and set up an alarm when it goes more than a certain threshold
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Dec 27 2016, 6:15 AM

Description

See Incident report

We want to alert when the proportion of revisions that cannot be scored for a wiki rises above a certain threshold.

Details

	Subject	Repo	Branch	Lines +/-
	icinga: Add scoring-team for alerts of ores-extension	operations/puppet	production	+5 -0

Customize query in gerrit

Related Objects

Mentioned In: Blog Post: Status Update (January 30, 2018)

Event Timeline

Ladsgroup created this task.Dec 27 2016, 6:15 AM

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptDec 27 2016, 6:15 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Peachey88 added a project: observability.Dec 27 2016, 9:18 AM

Peachey88 added a project: Wikimedia-Incident.Jan 3 2017, 12:28 AM

Peachey88 moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).Feb 9 2017, 3:52 PM

Halfak triaged this task as High priority.Feb 16 2017, 3:58 PM

Halfak moved this task from Unsorted to Maintenance/cleanup on the Machine-Learning-Team board.

Halfak updated the task description. (Show Details)Jul 20 2017, 2:47 PM

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Dec 3 2017, 12:37 PM

Added a group for scoring platform team: https://grafana-admin.wikimedia.org/alerting/notification/5/edit
Then added an alert when average of five minutes is above 10%, https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1

It should be enough for now but I'm not sure if the whole grafana alert system works or not. @Krinkle does it work? The "Send Test" button gives error to me "SMTP not set".

Ladsgroup moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Dec 9 2017, 5:56 PM

@Ladsgroup Grafana does not have outgoing E-mail configured. Instead of maintaining a separate list of contact groups and protocols for Grafana, it was decided to re-use the existing Icinga infrastructure for this.

See https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Alerts_(with_notifications_via_Icinga) for more information.

Short story: Alerts can be fully configured and maintained within Grafana. The only thing needed elsewhere is a one line configuration change (in Puppet) to enable Icinga alerts for a particular dashboard. Only the dashboard name and Icinga contact group name need to be specified. The rest remains dynamic and within Grafana only (including the individual alert names and their underlying queries etc.)

As for whether it works, I'd say yes. Performance Team regularly gets alert e-mails from its various dashboards. And I assume that after @Halfak (Thank you!) wrote the above docs, that it worked.

Krinkle added a subscriber: Halfak.Dec 18 2017, 5:33 PM

Thanks for the help. I will make the patches for that.

Change 399109 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] icinga: Add scoring-team for alerts of ores-extension

https://gerrit.wikimedia.org/r/399109

gerritbot added a project: Patch-For-Review.Dec 18 2017, 10:51 PM

Change 399109 merged by Dzahn:
[operations/puppet@production] icinga: Add scoring-team for alerts of ores-extension

https://gerrit.wikimedia.org/r/399109

Everything seems fine now, I wish we could build similar screaming system for beta cluster as well but all metrics are dead there: https://grafana-labs.wikimedia.org/dashboard/db/ores-extension?orgId=1 https://grafana-labs.wikimedia.org/dashboard/db/ores-beta-cluster?orgId=1&from=now-7d&to=now
Will look into this later on.

Ladsgroup moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Dec 20 2017, 1:16 PM

Krinkle unsubscribed.Dec 22 2017, 6:18 PM

Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.Dec 28 2017, 11:08 PM

awight mentioned this in Blog Post: Status Update (January 30, 2018).Jan 30 2018, 7:03 PM

Halfak closed this task as Resolved.Jan 30 2018, 8:32 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Maintenance_bot removed a project: Patch-For-Review.Apr 28 2020, 10:15 PM

Clean up failure ratio monitoring and set up an alarm when it goes more than a certain thresholdClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Clean up failure ratio monitoring and set up an alarm when it goes more than a certain threshold
Closed, ResolvedPublic
Actions