Page MenuHomePhabricator

"MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!)
Open, MediumPublic

Description

We have an Icinga check that looks at the rate of MediaWiki fatals and exceptions in Graphite and alarms whenever a threshold is passed.

I found out today that the alarm on IRC seems to have kick of 30 minutes after the spike seen in logstash. From the IRC logs at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20160728.txt

[12:46:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[12:50:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[12:58:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0]
[13:02:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]

The spike of exceptions happened at 12:15 as can be seen in logstash:

Or in Grafana Production Logging dashboard centered at 12:15 as well:


The Grafana metric being checked is transformNull(sumSeries(logstash.rate.mediawiki.fatal.ERROR.sum, logstash.rate.mediawiki.exception.ERROR.sum), 0) with warnings at 25 and critical at 50. Represented with OK state in green, Warning in orange and Critical in red from 12:00 to 13:00:

Graphite link for last six hours.

The Icinga logs

TimeLevelStateState countMessage
2016-07-28 12:08:58WARNINGSOFT120.00% of data above the warning threshold [25.0]
2016-07-28 12:10:57OKSOFT2Less than 1.00% above the threshold [25.0]
2016-07-28 12:22:38CRITICALSOFT140.00% of data above the critical threshold [50.0]Spike detected after 7 minutes
2016-07-28 12:24:38OKSOFT2Less than 1.00% above the threshold [25.0] Erroneously cleared?
2016-07-28 12:36:38WARNINGSOFT120.00% of data above the warning threshold [25.0]
2016-07-28 12:38:38OKSOFT2Less than 1.00% above the threshold [25.0]
2016-07-28 12:42:28CRITICALSOFT120.00% of data above the critical threshold [50.0]First notice
2016-07-28 12:44:28CRITICALSOFT240.00% of data above the critical threshold [50.0]Second
2016-07-28 12:46:27CRITICALHARD320.00% of data above the critical threshold [50.0]Third -> HARD == 1st IRC notification (PROBLEM)
2016-07-28 12:50:18OKHARD3Less than 1.00% above the threshold [25.0]2nd IRC notification (RECOVERY)
TimeLevelStateState countMessage
2016-07-28 12:54:18CRITICALSOFT140.00% of data above the critical threshold [50.0]
2016-07-28 12:56:18CRITICALSOFT240.00% of data above the critical threshold [50.0]
2016-07-28 12:58:18CRITICALHARD360.00% of data above the critical threshold [50.0]Third -> HARD = 3rd IRC notification (PROBLEM)
2016-07-28 13:00:18WARNINGHARD320.00% of data above the warning threshold [25.0]
2016-07-28 13:02:27OKHARD3Less than 1.00% above the threshold [25.0]4th IRC notification (RECOVERY)

Event Timeline

hashar created this task.Jul 28 2016, 1:15 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2016, 1:15 PM
hashar updated the task description. (Show Details)Jul 28 2016, 1:27 PM
hashar updated the task description. (Show Details)Jul 28 2016, 1:32 PM
hashar updated the task description. (Show Details)Jul 28 2016, 1:57 PM
hashar updated the task description. (Show Details)
greg added subscribers: ori, greg.Jul 28 2016, 4:45 PM

Ori is out for family reasons right now, but since he helped craft this alert I'm adding him here for his thoughts.

The alarm is not delayed by half an hour. It is delay but not as much as I thought.

Looking at the green/orange/red graph, one can see the huge spike at 12:15 which correlate with Icinga:

2016-07-28 12:22:38CRITICALSOFT140.00% of data above the critical threshold [50.0] minutes

There are two issues with this one:

  1. it is 7 minutes late
  2. it is in SOFT state indicating that Icinga is set to retry this X (actually 3) times before becoming HARD and triggering notification

The state got cleared exactly 2 minutes later when the spike ended.

The 1st IRC notification we received (at 12:50:18) is unrelated to the spike at 12:15. It is a second spike that occured at 12:40. It prompted me to look at the log and I thought it was for the previous spike hence complaining about more than half an hour of delay.

From Icinga we can see there are 3 retries each 2 minutes apparts before emitting the notification:

2016-07-28 12:42:28CRITICALSOFT1
2016-07-28 12:44:28CRITICALSOFT2
2016-07-28 12:46:27CRITICALHARD3

The things to look at are:

  • reduce the amount and or delay of retries
  • figure out why even the first step change is lagged out.

From the three spikes, one got unnotified and the two others delayed by 10 minutes.

hashar triaged this task as Medium priority.Jul 28 2016, 7:47 PM
hashar updated the task description. (Show Details)Jul 29 2016, 9:15 PM
fgiunchedi closed this task as Resolved.Dec 10 2019, 2:14 PM
fgiunchedi claimed this task.
fgiunchedi added a subscriber: fgiunchedi.

I believe nowadays the alert is based on metrics from logstash and appears on IRC in a timely fashion, resolving but please do reopen if it occurs again.

hashar reopened this task as Open.Dec 10 2019, 8:05 PM

I explained it in the task, the issue is in Icinga configuration and is still occuring.

On the first critical, Icinga flag the service as SOFT and raise a counter to 1. In our configuration Icinga retries 2 times with a 2 minutes delay between retries (each raising the counter). If the counter reaches 3 the service is escalated to HARD state which triggers the notification.

That is fine to avoid flapping alarm, but often it causes unnecessary delay in the notification (due to the retries) and cause extra checks on the Icinga host (due to the retries).

A random example from today:

History for analytics1064 > DPKG

Service Critical[2019-12-10 15:11:33] SERVICE ALERT: analytics1064;DPKG;CRITICAL;SOFT;1;DPKG CRITICAL dpkg reports broken packages
Service Critical[2019-12-10 15:13:19] SERVICE ALERT: analytics1064;DPKG;CRITICAL;SOFT;2;DPKG CRITICAL dpkg reports broken packages
Service Critical[2019-12-10 15:15:05] SERVICE ALERT: analytics1064;DPKG;CRITICAL;HARD;3;DPKG CRITICAL dpkg reports broken packages
Service Ok [2019-12-10 15:16:51] SERVICE ALERT: analytics1064;DPKG;OK;HARD;3;All packages OK

Notifications for analytics1064 > DPKG:

analytics1064 DPKG CRITICAL 2019-12-10 15:15:05 irc notify-service-by-irc DPKG CRITICAL dpkg reports broken packages
analytics1064 DPKG OK 2019-12-10 15:16:51 irc notify-service-by-irc All packages OK

The issue has been detected at 15:11:33, reached hard state at 15:15:05 which triggered the notification. That is a 3m30s delay, which is what this task is about.

fgiunchedi removed fgiunchedi as the assignee of this task.Jul 20 2020, 2:00 PM

Unassigning from me since I'm not working directly on this, anecdotally the "mediawiki exceptions" alert now is working as intended (including the soft -> hard transition)

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 20 2020, 2:01 PM

Indeed, there is a bit of delay due to retries and the default retry_interval of 1 (minute) which seems appropriate for most cases.

@hashar does this still need looked at/addressed? If so, how much delay do you think is acceptable or appropriate?