Currently we use Grafana to view metrics and set alert rules and alert intervals.
When Grafana detects an issue, it does nothing because we've not configured it's notification system. The alerts are only seen when actively looking for them in the web interface, or via the API.
The Icgina system is then configured to poll the Grafana API at an interval, and when alerts-firing is non-empty, it enters state CRITICAL from the host this check runs on ("einsteinium.eqiad.wmnet"), and sends notifications by IRC to #wikimedia-perf-bots and to the perf-team list by e-mail.
So far so good, except that when Opsen review the Icinga global dashboard, any metric changes that are currently in bad state, will show up as "unhandled", this is because Icinga uses a concept of acknowledgement internally, which isn't really part of our workflow. We just respond to the e-mail on our list with our investigation and based on that either:
- decide to file a task and work on it, and archive the mail (if high prio).
- explain by e-mail why it's fine as-is, then wait to confirm it recovers, and then archive the mail.
- We should still have IRC and E-mail notifications for alerts from Grafana dashboards.
- The alerts should not show up as "un-acknowledged" for Opsen on the Icinga global dashboard.
- Change the Icinga-Grafana layer to map alert state to WARN instead of CRIT. This would bypass the need for ACK, but according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable.
- Some way to exclude Grafana alerts from the Icinga dashboard.
- Some way to auto-acknowledge Grafana alerts.
- Just get in the habit of acknowledging them via the Icinga web UI? I personally don't mind, but the issue is, we never look at it regularly, so it's quite easy to forget and we'd likely still get the occasional ping from Ops.
- Stop using Icinga and notify by E-mail and IRC directly from Grafana?
Personally, I'm more favourable toward number 5. I think we should be careful to do this in a way that is the least confusing for everyone involved. And we should still allow other users of Grafana to use Icinga like today for service-critical notifications that are clearly actionable and also benefit from Phone/Paging integration that Icinga offers.
But for us, the built-in integration for metrics only could be quite useful. And would mean we get graphs inside the e-mails (as Grafana does) and other such things. Which seems nice.