Page MenuHomePhabricator

Revisit Grafana/Icinga notification strategy
Open, MediumPublic

Description

Currently we use Grafana to view metrics and set alert rules and alert intervals.

When Grafana detects an issue, it does nothing because we've not configured it's notification system. The alerts are only seen when actively looking for them in the web interface, or via the API.

The Icgina system is then configured to poll the Grafana API at an interval, and when alerts-firing is non-empty, it enters state CRITICAL from the host this check runs on ("einsteinium.eqiad.wmnet"), and sends notifications by IRC to #wikimedia-perf-bots and to the perf-team list by e-mail.

So far so good, except that when Opsen review the Icinga global dashboard, any metric changes that are currently in bad state, will show up as "unhandled", this is because Icinga uses a concept of acknowledgement internally, which isn't really part of our workflow. We just respond to the e-mail on our list with our investigation and based on that either:

  • decide to file a task and work on it, and archive the mail (if high prio).
  • explain by e-mail why it's fine as-is, then wait to confirm it recovers, and then archive the mail.

Outcome

  • We should still have IRC and E-mail notifications for alerts from Grafana dashboards.
  • The alerts should not show up as "un-acknowledged" for Opsen on the Icinga global dashboard.

Ideas:

  1. Change the Icinga-Grafana layer to map alert state to WARN instead of CRIT. This would bypass the need for ACK, but according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable.
  2. Some way to exclude Grafana alerts from the Icinga dashboard.
  3. Some way to auto-acknowledge Grafana alerts.
  4. Just get in the habit of acknowledging them via the Icinga web UI? I personally don't mind, but the issue is, we never look at it regularly, so it's quite easy to forget and we'd likely still get the occasional ping from Ops.
  5. Stop using Icinga and notify by E-mail and IRC directly from Grafana?

Personally, I'm more favourable toward number 5. I think we should be careful to do this in a way that is the least confusing for everyone involved. And we should still allow other users of Grafana to use Icinga like today for service-critical notifications that are clearly actionable and also benefit from Phone/Paging integration that Icinga offers.

But for us, the built-in integration for metrics only could be quite useful. And would mean we get graphs inside the e-mails (as Grafana does) and other such things. Which seems nice.

See also:

Event Timeline

Krinkle created this task.Sep 4 2018, 5:08 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2018, 5:08 PM
Krinkle renamed this task from Revisit Grafana/Icigna notification strategy to Revisit Grafana/Icinga notification strategy.Sep 4 2018, 5:09 PM
Gilles added a comment.Sep 4 2018, 6:23 PM

Ops perviously rejected the idea of using the built-in Grafana notification system, as they don't want to maintain another alert/notification/paging system. Which is why we ended up building the Icinga bridge.

Has there been any recent complaints from Ops about these alerts being unacknowledged? I think they're learned by now to ignore them.

Has there been any recent complaints from Ops about these alerts being unacknowledged?

Indeed. The conversation in #wikimedia-operations led to me filing this ticket, and can be summarised as:

Task description:

The alerts should not show up as "un-acknowledged" for Opsen on the Icinga global dashboard.

Right now they appear as einsteinium | PROBLEM | CRITICAL | ..

The last one to be unaware of this was me while checking un-handled criticals in Icinga :)

Since they were all about timings spiking over the weekend I thought to ping people but not to alert anybody, I was not aware that these alerts are not to be considered by Ops. From my point of view having criticals in icinga that relies on people knowing that they are not actionable/important is a recipe for failure, there will always be people not aware of this "rule" trying to see if anything is happening or should be done. I do get that this is a compromise reached in the past to avoid maintaining alerts on Grafana itself (without the Icinga bridge), but in my opinion this choice should be re-examined by the Infrastructure team? Or possibly find another solution..

according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable.

But wouldn't that be a feature? I mean, don't you want to disable IRC notifications anyways?

Change 459864 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga::performance: remind users to ignore checks using notes_url

https://gerrit.wikimedia.org/r/459864

Change 459862 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] monitoring: enable using notes_url with grafana_alert

https://gerrit.wikimedia.org/r/459862

according to @Dzahn only CRITs get IRC notifications. I don't know if that is configurable.

But wouldn't that be a feature? I mean, don't you want to disable IRC notifications anyways?

Quite the opposite, actually. We configure alerts in Grafana to be notified when they exceed the boolean threshold and would like to be notified via IRC (#wikimedia-perf-bots) and via E-mail (team alias). That's currently working as intended.

Ok, gotcha.

I think the best solution is probably to use event_handlers to let Icinga auto-ACK the checks. That would keep info on IRC (and in email) as is but clearly mark them as "handled" in the web UI. It would make it obvious for SRE and you still don't have to manually ACK.

This would be like https://serverfault.com/questions/543199/how-to-make-a-persistent-acknowledgment-in-icinga-nagios

Second best but easier is adding a link using notes_url (also: T197873) as on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459864/ but that first needs this parameter in the grafana_alert class. Adding that in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/459862/

fgiunchedi added a subscriber: fgiunchedi.

Adding observability for visibility/discussion. This is unfortunately one of the cases where icinga "multi tenancy" model breaks down, namely the list of alerts SRE looks at in the UI isn't SRE-specific but lists all alerts for all teams.

Dzahn added a comment.Sep 12 2018, 2:13 PM

That's partially just our configuration though since we make all SREs a contact for all services. We could as well add proper host groups and service groups and only look at those pages.. also we wouldn't be looking at them if they were marked as handled. I don't think it's particularly Icinga's fault.

Change 459862 merged by Dzahn:
[operations/puppet@production] monitoring: enable using notes_url with grafana_alert

https://gerrit.wikimedia.org/r/459862

Change 459864 merged by Dzahn:
[operations/puppet@production] icinga::performance: remind users to ignore checks using notes_url

https://gerrit.wikimedia.org/r/459864

I added a link to this ticket next to the performance related grafana checks in the Icinga web UI, using notes_url. This was possible after adding the notes_url parameter to the grafana_alert class (second merge above).

This can now be seen at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=alerts

(searching for the string "alerts" is an easy way to find them grouped together)

See the new yellow folder icon next to the alerts.

MoritzMuehlenhoff triaged this task as Medium priority.Sep 26 2018, 11:49 AM
Krinkle added a comment.EditedJul 12 2019, 5:47 PM

In recent weeks there have been three or four occasions where an SRE kindly informed us about an "on-going alert". I'm not sure if something changed in Icinga, or whether SRE is paying attention more closely, or something else entirely, but it's clear that the current situation is causing confusion and is wasting SRE's and our team's time.

Perhaps it's time to revisit this issue and find a suitable solution in which:

  • SRE does not get distracted by seemingly "critical" alerts that are in fact, not even remotely critical.
  • Perf gets emails to their team list and IRC pings in wikimedia-perf-bots when Grafana perf monitoring finds potential regressions.

In recent weeks there have been three or four occasions where an SRE kindly informed us about an "on-going alert". I'm not sure if something changed in Icinga, or whether SRE is paying attention more closely, or something else entirely, but it's clear that the current situation is causing confusion and is wasting SRE's and our team's time.

Perhaps it's time to revisit this issue and find a suitable solution in which:

  • SRE does not get distracted by seemingly "critical" alerts that are in fact, not even remotely critical.
  • Perf gets emails to their team list and IRC pings in wikimedia-perf-bots when Grafana perf monitoring finds potential regressions.

Thanks for the feedback @Krinkle ! I agree it is suboptimal at the moment and distracting. I'm not sure what the right answer is yet. However the issue of multiple systems issuing alerts (and in general true multi tenancy for alerting) has come up in the past and will be part of this Q's goals for observability, namely the alerting roadmap at T228379: Improve our alerting capabilities (Q1 goal FY19-20). HTH!

o/
From the WMDE side we would love to be able to set up more alerts for more things.
grafana could be a great place for this,

We actually just setup an icinga thing in labs for monitoring our termbox ssr service in beta.

Today we would have been helped by having monitoring on the response times of wikidata apis and alerts for anything too high.
Or even a drop in wikidata edits lower than 100 per min.

Has there been much progress in the past year on this topic? A summary here would be great. It looks like things happened, but reading through the tickets, I'm not sure what.

This task is a placeholder for improving the current situation. It is not a blocker. Various teams at WMF use Grafana for alerting and I encourage WMDE to do this as well.

The current methodology is:

  1. Create a "<something> Alerts" dashboard
  2. Create an Icinga notifier ("contact group") in Puppet (e.g. a simple e-mail contact that just e-mails your team in some way, and/or a notifier to an IRC channel),
  3. Add a line to a file like this in Puppet that connects that Grafana dashboard to your Icinga notifier.
herron added a subscriber: herron.Oct 30 2019, 8:03 PM

This task is a placeholder for improving the current situation. It is not a blocker. Various teams at WMF use Grafana for alerting and I encourage WMDE to do this as well.

The current methodology is:

  1. Create a "<something> Alerts" dashboard
  2. Create an Icinga notifier ("contact group") in Puppet (e.g. a simple e-mail contact that just e-mails your team in some way, and/or a notifier to an IRC channel),
  3. Add a line to a file like this in Puppet that connects that Grafana dashboard to your Icinga notifier.

Thanks!
I'm going to do this right now

Change 547404 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/puppet@production] Setup wikidata alerts from grafana dashboard

https://gerrit.wikimedia.org/r/547404

Change 547404 merged by Alexandros Kosiaris:
[operations/puppet@production] Setup wikidata alerts from grafana dashboard

https://gerrit.wikimedia.org/r/547404

o/
From the WMDE side we would love to be able to set up more alerts for more things.
grafana could be a great place for this,

We actually just setup an icinga thing in labs for monitoring our termbox ssr service in beta.

Today we would have been helped by having monitoring on the response times of wikidata apis and alerts for anything too high.
Or even a drop in wikidata edits lower than 100 per min.

Has there been much progress in the past year on this topic? A summary here would be great. It looks like things happened, but reading through the tickets, I'm not sure what.

Thanks for reaching out @Addshore! Looks like the immediate issue is being solved (thanks @Krinkle @akosiaris)

I'll give an high level update/summary on what's happening in Observability for notification/escalation/etc. We've worked on the alerting infrastructure roadmap (feedback welcome!) to communicate the work ahead and provide a common vision on the future of alerting.

Ultimately the goal is to have Grafana alerts being another source of notifications (alongside e.g. Icinga, Prometheus, Librenms, etc) In Q2 of FY19/20 we're focusing our efforts on an alert escalation that will help us, among other things, to make sure notifications reach who's interested in them.

HTH!

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 2:08 PM
ema added a subscriber: ema.Jul 10 2020, 1:47 PM

This came up again today. Due to my very short memory I forgot all about the performance team alerts and started complaining about the fact that we shouldn't have critical stuff in icinga for things that aren't directly actionable and operationally important. @Krinkle patiently explained the story once again and pointed me here.

@fgiunchedi: any updates/new ideas on how we can make this better? My entirely uninformed opinion is that having Grafana send email/irc notifications would be a viable option.

My entirely uninformed opinion is that having Grafana send email/irc notifications would be a viable option.

The issue with that is that we have no sane way of managing user+email destinations right now in Grafana, but we do have in puppet+icinga. To give an idea just last week I noticed a former colleague (that's not been around for years) as a recipient of alerts. Those alerts have probably been bouncing for years as their destination email has been delete by OIT. The reason it's a one off is exactly because we don't support this pattern. That does not mean that the current situation is particularly great, just that we need to figure out a proper way of managing alerts+emails in Grafana before we allow that to happen. Alternatively (and vastly preferable for me) we could find some other way of defining alerts+emails in code in order to maintain them more easily, e.g. using prometheus alert manager.