There should be some alert so that people can see things like 30x error rate increase from JobRunner. The logs are server-local and thus not in logstash. Having an alert would at least get people to know to check the logs and that there is an issue.
@aaron Nice. I recall there being an open issue about the absence of these logs (T172479), but I realise now that that is only about error details (which we could send to Logstash at some point). The counts are already available in Graphite indeed.
For the record, the Job Queue Health dashboard uses the same metric for its error graph:
I started a quick dashboard at https://grafana.wikimedia.org/dashboard/db/job-queue-alerts?orgId=1&from=now-12h&to=now with some alerts.
The job failure rate shows as 0/min in graphite which is suspect, though can't see an obvious reason for that.
Compared to https://grafana-admin.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&panelId=17&fullscreen&edit, it needs scale(60) to be an accurate rate per minute. Right now it is an average rate per second . (Could also use aliasByNode(2) for clarity.)
See https://wikitech.wikimedia.org/wiki/Graphite#Extended_properties for more about .rate. Our smallest aggregation window in Graphite (for last 7 days) is 1 minute. It would've been nice for consistency if statsd's rate were also per-minute, but alas, statsd standardises on per-second instead.