Page MenuHomePhabricator

Setup grafana alert for job error rate
Closed, ResolvedPublic

Description

There should be some alert so that people can see things like 30x error rate increase from JobRunner. The logs are server-local and thus not in logstash. Having an alert would at least get people to know to check the logs and that there is an issue.

Event Timeline

aaron triaged this task as Low priority.Aug 16 2017, 7:14 PM
aaron moved this task from Inbox to Next: Goal-oriented on the Performance-Team board.
aaron removed aaron as the assignee of this task.Oct 2 2017, 6:08 PM

I suppose we can use jobrunner.runner-status.error.rate, sumSeries(jobrunner.pop.*.failed.*.rate), and sumSeries(jobrunner.pop.*.ok.*.rate) to make alerts in a Grafana dashboard.

@aaron Nice. I recall there being an open issue about the absence of these logs (T172479), but I realise now that that is only about error details (which we could send to Logstash at some point). The counts are already available in Graphite indeed.

For the record, the Job Queue Health dashboard uses the same metric for its error graph:

https://grafana.wikimedia.org/dashboard/db/job-queue-health?panelId=17&fullscreen

I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.

The job failure rate shows as 0/min in graphite which is suspect, though can't see an obvious reason for that.

I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.

The job failure rate shows as 0/min in graphite which is suspect, though can't see an obvious reason for that.

Compared to https://grafana-admin.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&panelId=17&fullscreen&edit, it needs scale(60) to be an accurate rate per minute. Right now it is an average rate per second . (Could also use aliasByNode(2) for clarity.)

See https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties for more about .rate. Our smallest aggregation window in Graphite (for last 7 days) is 1 minute. It would've been nice for consistency if statsd's rate were also per-minute, but alas, statsd standardises on per-second instead.

Krinkle assigned this task to aaron.