We have an Icinga checks that look at the rate of MediaWiki fatals and exceptions in Graphite and alarms whenever a threshold is passed.
I found out today that the alarm on IRC seems to have kick of 30 minutes after the spike seen in logstash. From the IRC logs at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20160728.txt
```
lang=irc
[12:46:28] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[12:50:18] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[12:58:18] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0]
[13:02:27] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
```
The spike of exceptions happened at 12:15 as can be seen in logstash:
{F4314919 size=full}
Or in Grafana Production Logging dashboard centered at 12:15 as well:
{F4314924 size=full}