Page MenuHomePhabricator

Create Nagios Grafana alert checks
Closed, ResolvedPublic

Description

Goal: Make it possible for Icinga to get alerts from Grafana.

We will start to use it for alerts for WebPageTest and Navigation Timing (and related) metrics.

@faidon pointed me to the current Graphite check as a starting point to understand how it works:
https://github.com/wikimedia/operations-puppet/blob/production/modules/nagios_common/files/check_commands/check_graphite

The alerts is available through the JSON API in Grafana: https://grafana.wikimedia.org/api/alerts

Event Timeline

Gilles triaged this task as High priority.Mar 9 2017, 1:13 PM
Gilles lowered the priority of this task from High to Low.Mar 9 2017, 1:32 PM

First we should create the alerts using the existing graphite adapter by copying the threshold we set up in grafana to puppet and see how often we need to update that.

Gilles raised the priority of this task from Low to High.Mar 13 2017, 9:31 AM
Gilles moved this task from Backlog: Maintenance to Doing (old) on the Performance-Team board.

Reproducing the Grafana alerts as Graphite checks failed, so now I'm working on this alert "forwarding" from Grafana to Nagios.

This is set up now. The IRC component works on #wikimedia-perf-bots:

[15:39:35] <icinga-wm> [2017-03-24 17:21:45] PROBLEM - https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is alerting: Difference in size authenticated.

[15:39:35] <icinga-wm> [2017-03-25 00:25:44] RECOVERY - https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is not alerting.

We're supposed to receive alert emails as well on our team mailing list, but I haven't seen that first alert make it through.

Looking at https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts?refresh=5m&panelId=49&fullscreen&orgId=1&from=now-7d&to=now we can indeed see that the dashboard was alerting at the time the puppet change to forward alerts was merged and that it recovered exactly when the recovery message made it to IRC. So, it works! Left a comment on the changeset about the mailing list not working: https://gerrit.wikimedia.org/r/#/c/342431/