Page MenuHomePhabricator

Set up grafana alerting for services
Closed, ResolvedPublic

Description

Grafana now has alerting capabilities, which are significantly more user friendly than defining icinga alerts on graphite data via puppet. However, per T153167 email notifications in grafana are not operational, and the only way to use grafana alerts is via a per-dashboard indirection in nagios.

In a first step, we would like to be alerted for the following dashboards:

  • restbase
  • api-summary
  • services-alerts
  • eventbus

For reference, the original commit to set up alerting for performance is this one:

https://github.com/wikimedia/puppet/commit/401973ab7f79fd4567749fe074ccce1d47446581

The alerts should hit 'team-services' in nagios.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 12 2017, 3:00 AM
GWicke updated the task description. (Show Details)Apr 12 2017, 3:00 AM
GWicke updated the task description. (Show Details)Apr 12 2017, 3:05 AM
GWicke updated the task description. (Show Details)Apr 12 2017, 3:22 AM
fgiunchedi triaged this task as Medium priority.Apr 12 2017, 7:58 AM

@Halfak pointed out that using this to check the retry topic rate for ORES would help them

Our task: T167830: Extend icinga check to catch 500 errors like those of the 20170613 incident

Essentially intermittent 500s were not getting caught by the icinga check quickly enough so we'd like to catch it via ChangeProp since it hits ORES really fast :)

@Pchelolo, @Halfak: If you want to go the grafana route for ease of modification, I would recommend to set up a separate dashboard for ORES retries. You can then make that dashboard alert the right ORES group (only one group per dashboard), and tweak / add alerts as needed in that dashboard.

Change 362567 had a related patch set uploaded (by GWicke; owner: GWicke):
[operations/puppet@production] Set up grafana dashboard monitoring for services

https://gerrit.wikimedia.org/r/362567

Change 362567 merged by Alexandros Kosiaris:
[operations/puppet@production] Set up grafana dashboard monitoring for services

https://gerrit.wikimedia.org/r/362567

GWicke closed this task as Resolved.Jul 5 2017, 6:34 PM
GWicke claimed this task.

I just verified that this is working by temporarily lowering the alert threshold in one of the dashboards.

The services team will now receive all alerts for the dashboards listed in the task description.

mobrovac added a subscriber: mobrovac.

Really really nice to have this finally. Thank you @GWicke !