Page MenuHomePhabricator

Add alerts for Logstash rates in production
Closed, ResolvedPublic


Scap monitors Logstash error rates from canary servers during a deployment. However, problems are not always triggered by a deployment. They may be triggered by an external factor, or a cron job, or may only reveal themselves after a certain cache is purged or expired, etc.

As such, we should have an Icinga alert (Based on Graphite, Prometheus or Grafana?) that triggers when the WARNING or ERROR rate of mediawiki logs increases above a certain threshold for a prolonged period of time.

This would be similar for the alerts that we have already for MediaWiki exceptions.

This is actionable from

Event Timeline

Fjalapeno added a subscriber: Fjalapeno.

Adding Operations, but not sure where this task should actually go

herron triaged this task as Medium priority.Jul 17 2018, 3:28 PM
herron added a subscriber: fgiunchedi.

Looks like logstash rates are in graphite ATM, so either a grafana alert and its counterpart in puppet or a graphite alert. Both would work fine I think, the grafana alert has the advantage of tuning thresholds is self-service (i.e. no puppet merge required)

fgiunchedi claimed this task.

We have icinga alerts for mediawiki errors rates nowadays, based on Prometheus metrics (via logstash -> statsd -> prometheus)


12:49 -icinga-wm:#wikimedia-operations- PROBLEM - MediaWiki exceptions and fatals per 
          minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter 
          level=ERROR site=eqiad