Change Details

We have an Icinga checks that look at the rate of MediaWiki fatals and exceptions in Graphite and alarms whenever a threshold is passed. I found out today that the alarm on IRC seems to have kick of 30 minutes after the spike seen in logstash. From the IRC logs at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20160728.txt ``` lang=irc [12:46:28] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [12:50:18] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [12:58:18] <icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0] [13:02:27] <icinga-wm> RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] ``` The spike of exceptions happened at 12:15 as can be seen in logstash: {F4314919 size=full} Or in Grafana Production Logging dashboard centered at 12:15 as well: {F4314924 size=full} ----------- The [[ https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=graphite1001&service=MediaWiki+exceptions+and+fatals+per+minute | Icinga logs ]] | Time | Level | State | State count | Message | |--|--|--|--|--|-- | 2016-07-28 12:08:58 | WARNING | SOFT | 1 | 20.00% of data above the warning threshold [25.0] | | 2016-07-28 12:10:57 | OK | SOFT | 2 | Less than 1.00% above the threshold [25.0] | | 2016-07-28 12:22:38 | CRITICAL | SOFT | 1 | 40.00% of data above the critical threshold [50.0] | **Spike detected after 7 minutes** | 2016-07-28 12:24:38 | OK | SOFT | 2 | Less than 1.00% above the threshold [25.0] | 2016-07-28 12:36:38 | WARNING | SOFT | 1 | 20.00% of data above the warning threshold [25.0] | 2016-07-28 12:38:38 | OK | SOFT | 2 | Less than 1.00% above the threshold [25.0] | 2016-07-28 12:42:28 | CRITICAL | SOFT | 1 | 20.00% of data above the critical threshold [50.0] | First notice | 2016-07-28 12:44:28 | CRITICAL | SOFT | 2 | 40.00% of data above the critical threshold [50.0] | Second | 2016-07-28 12:46:27 | CRITICAL | HARD | 3 | 20.00% of data above the critical threshold [50.0] | Third -> HARD == 1st IRC notification (PROBLEM) | 2016-07-28 12:50:18 | OK | HARD | 3 | Less than 1.00% above the threshold [25.0] | 2nd IRC notification (RECOVERY) | Time | Level | State | State count | Message | |--|--|--|--|--|-- | 2016-07-28 12:54:18 | CRITICAL | SOFT | 1 | 40.00% of data above the critical threshold [50.0] | 2016-07-28 12:56:18 | CRITICAL | SOFT | 2 | 40.00% of data above the critical threshold [50.0] | 2016-07-28 12:58:18 | CRITICAL | HARD | 3 | 60.00% of data above the critical threshold [50.0] | Third -> HARD = 3rd IRC notification (PROBLEM) | 2016-07-28 13:00:18 | WARNING | HARD | 3 | 20.00% of data above the warning threshold [25.0] | 2016-07-28 13:02:27 | OK | HARD | 3 | Less than 1.00% above the threshold [25.0] | 4th IRC notification (RECOVERY)