Page MenuHomePhabricator

Add alerts for Logstash rates in production
Closed, ResolvedPublic

Description

Scap monitors Logstash error rates from canary servers during a deployment. However, problems are not always triggered by a deployment. They may be triggered by an external factor, or a cron job, or may only reveal themselves after a certain cache is purged or expired, etc.

As such, we should have an Icinga alert (Based on Graphite, Prometheus or Grafana?) that triggers when the WARNING or ERROR rate of mediawiki logs increases above a certain threshold for a prolonged period of time.

This would be similar for the alerts that we have already for MediaWiki exceptions.


This is actionable from https://wikitech.wikimedia.org/wiki/Incident_documentation/20180710-MediaWiki.

Event Timeline

Krinkle created this task.Jul 12 2018, 9:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2018, 9:37 PM
Fjalapeno added a subscriber: Fjalapeno.

Adding Operations, but not sure where this task should actually go

herron triaged this task as Medium priority.Jul 17 2018, 3:28 PM
herron added a subscriber: fgiunchedi.

Looks like logstash rates are in graphite ATM, so either a grafana alert and its counterpart in puppet or a graphite alert. Both would work fine I think, the grafana alert has the advantage of tuning thresholds is self-service (i.e. no puppet merge required)

fgiunchedi closed this task as Resolved.Jul 6 2020, 2:11 PM
fgiunchedi claimed this task.

We have icinga alerts for mediawiki errors rates nowadays, based on Prometheus metrics (via logstash -> statsd -> prometheus)

e.g.

12:49 -icinga-wm:#wikimedia-operations- PROBLEM - MediaWiki exceptions and fatals per 
          minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter 
          level=ERROR site=eqiad 
          https://wikitech.wikimedia.org/wiki/Application_servers 
https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops