"MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	hashar
	Jul 28 2016, 1:15 PM

Description

We have an Icinga check that looks at the rate of MediaWiki fatals and exceptions in Graphite and alarms whenever a threshold is passed.

I found out today that the alarm on IRC seems to have kick of 30 minutes after the spike seen in logstash. From the IRC logs at http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20160728.txt

[12:46:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0]
[12:50:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]
[12:58:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [50.0]
[13:02:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0]

The spike of exceptions happened at 12:15 as can be seen in logstash:

20160728-spike-of-exceptions.png (180×684 px, 28 KB)

Or in Grafana Production Logging dashboard centered at 12:15 as well:

20160728-spike-grafana.png (407×708 px, 42 KB)

The Grafana metric being checked is transformNull(sumSeries(logstash.rate.mediawiki.fatal.ERROR.sum, logstash.rate.mediawiki.exception.ERROR.sum), 0) with warnings at 25 and critical at 50. Represented with OK state in green, Warning in orange and Critical in red from 12:00 to 13:00:

Graphite link for last six hours.

The Icinga logs

Time	Level	State	State count	Message
2016-07-28 12:08:58	WARNING	SOFT	1	20.00% of data above the warning threshold [25.0]
2016-07-28 12:10:57	OK	SOFT	2	Less than 1.00% above the threshold [25.0]
2016-07-28 12:22:38	CRITICAL	SOFT	1	40.00% of data above the critical threshold [50.0]	Spike detected after 7 minutes
2016-07-28 12:24:38	OK	SOFT	2	Less than 1.00% above the threshold [25.0] Erroneously cleared?
2016-07-28 12:36:38	WARNING	SOFT	1	20.00% of data above the warning threshold [25.0]
2016-07-28 12:38:38	OK	SOFT	2	Less than 1.00% above the threshold [25.0]
2016-07-28 12:42:28	CRITICAL	SOFT	1	20.00% of data above the critical threshold [50.0]	First notice
2016-07-28 12:44:28	CRITICAL	SOFT	2	40.00% of data above the critical threshold [50.0]	Second
2016-07-28 12:46:27	CRITICAL	HARD	3	20.00% of data above the critical threshold [50.0]	Third -> HARD == 1st IRC notification (PROBLEM)
2016-07-28 12:50:18	OK	HARD	3	Less than 1.00% above the threshold [25.0]	2nd IRC notification (RECOVERY)
Time	Level	State	State count	Message
2016-07-28 12:54:18	CRITICAL	SOFT	1	40.00% of data above the critical threshold [50.0]
2016-07-28 12:56:18	CRITICAL	SOFT	2	40.00% of data above the critical threshold [50.0]
2016-07-28 12:58:18	CRITICAL	HARD	3	60.00% of data above the critical threshold [50.0]	Third -> HARD = 3rd IRC notification (PROBLEM)
2016-07-28 13:00:18	WARNING	HARD	3	20.00% of data above the warning threshold [25.0]
2016-07-28 13:02:27	OK	HARD	3	Less than 1.00% above the threshold [25.0]	4th IRC notification (RECOVERY)

Event Timeline

hashar created this task.Jul 28 2016, 1:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2016, 1:15 PM

hashar updated the task description. (Show Details)Jul 28 2016, 1:27 PM

hashar updated the task description. (Show Details)Jul 28 2016, 1:32 PM

hashar updated the task description. (Show Details)Jul 28 2016, 1:57 PM

hashar updated the task description. (Show Details)

I have added a new Grafana panel showing the state of that specific check on https://grafana.wikimedia.org/dashboard/db/production-logging Production Logging.

Direct link is https://grafana.wikimedia.org/dashboard/db/production-logging?panelId=19&fullscreen

Ori is out for family reasons right now, but since he helped craft this alert I'm adding him here for his thoughts.

The alarm is not delayed by half an hour. It is delay but not as much as I thought.

Looking at the green/orange/red graph, one can see the huge spike at 12:15 which correlate with Icinga:

2016-07-28 12:22:38

CRITICAL

SOFT

40.00% of data above the critical threshold [50.0] minutes

There are two issues with this one:

it is 7 minutes late
it is in SOFT state indicating that Icinga is set to retry this X (actually 3) times before becoming HARD and triggering notification

The state got cleared exactly 2 minutes later when the spike ended.

The 1st IRC notification we received (at 12:50:18) is unrelated to the spike at 12:15. It is a second spike that occured at 12:40. It prompted me to look at the log and I thought it was for the previous spike hence complaining about more than half an hour of delay.

From Icinga we can see there are 3 retries each 2 minutes apparts before emitting the notification:

2016-07-28 12:42:28	CRITICAL	SOFT	1
2016-07-28 12:44:28	CRITICAL	SOFT	2
2016-07-28 12:46:27	CRITICAL	HARD	3

The things to look at are:

reduce the amount and or delay of retries
figure out why even the first step change is lagged out.

From the three spikes, one got unnotified and the two others delayed by 10 minutes.

hashar triaged this task as Medium priority.Jul 28 2016, 7:47 PM

hashar updated the task description. (Show Details)Jul 29 2016, 9:15 PM

greg moved this task from INBOX to Watching / External on the Release-Engineering-Team board.May 20 2017, 12:22 PM

greg edited projects, added Release-Engineering-Team (Watching / External); removed Release-Engineering-Team.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:46 PM

• Phabricator_maintenance added a project: Release-Engineering-Team-TODO.Jun 12 2019, 11:45 PM

• Phabricator_maintenance moved this task from Should be empty (use Release-Engineering-Team) to Watching/External on the Release-Engineering-Team-TODO board.Jun 12 2019, 11:48 PM

• Phabricator_maintenance removed a project: Release-Engineering-Team (Watching / External).Jun 12 2019, 11:49 PM

greg added a project: Release-Engineering-Team.Jun 21 2019, 10:35 PM

greg moved this task from INBOX to Deployment services on the Release-Engineering-Team board.Jul 9 2019, 5:48 PM

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.

I believe nowadays the alert is based on metrics from logstash and appears on IRC in a timely fashion, resolving but please do reopen if it occurs again.

I explained it in the task, the issue is in Icinga configuration and is still occuring.

On the first critical, Icinga flag the service as SOFT and raise a counter to 1. In our configuration Icinga retries 2 times with a 2 minutes delay between retries (each raising the counter). If the counter reaches 3 the service is escalated to HARD state which triggers the notification.

That is fine to avoid flapping alarm, but often it causes unnecessary delay in the notification (due to the retries) and cause extra checks on the Icinga host (due to the retries).

A random example from today:

History for analytics1064 > DPKG

Service Critical[2019-12-10 15:11:33] SERVICE ALERT: analytics1064;DPKG;CRITICAL;SOFT;1;DPKG CRITICAL dpkg reports broken packages
Service Critical[2019-12-10 15:13:19] SERVICE ALERT: analytics1064;DPKG;CRITICAL;SOFT;2;DPKG CRITICAL dpkg reports broken packages
Service Critical[2019-12-10 15:15:05] SERVICE ALERT: analytics1064;DPKG;CRITICAL;HARD;3;DPKG CRITICAL dpkg reports broken packages
Service Ok [2019-12-10 15:16:51] SERVICE ALERT: analytics1064;DPKG;OK;HARD;3;All packages OK

Notifications for analytics1064 > DPKG:

analytics1064 DPKG CRITICAL 2019-12-10 15:15:05 irc notify-service-by-irc DPKG CRITICAL dpkg reports broken packages
analytics1064 DPKG OK 2019-12-10 15:16:51 irc notify-service-by-irc All packages OK

The issue has been detected at 15:11:33, reached hard state at 15:15:05 which triggered the notification. That is a 3m30s delay, which is what this task is about.

Unassigning from me since I'm not working directly on this, anecdotally the "mediawiki exceptions" alert now is working as intended (including the soft -> hard transition)

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 20 2020, 2:01 PM

Indeed, there is a bit of delay due to retries and the default retry_interval of 1 (minute) which seems appropriate for most cases.

@hashar does this still need looked at/addressed? If so, how much delay do you think is acceptable or appropriate?

Hello,

3M delay seems like a short but acceptable window for alerting. If there is a need to shorten this down we can discuss.. Closing this ticket, please reopen if you'd like to revisit the conversation.

In a nutshell the first issue is that the first spike of error at 12:15 was large in number of events but happened on a very short window. The statsd check was made on a larger window and less than 50 % of the points were failing so it did not trigger an error.

Assuming a spike that spans 3 units of time, and a statsd check that spans 7 units:

errors
|          x
|          x
|         xxx
|         xxx
+---->0000111<

There are 3 points flagged as error (1 above), 4 which are fine (0 above) that less than 50% of data points in error state and the spike is left unnoticed.

The second issue is similar, the spike is noticed a bit late since it has to occupy the statsd check window until 50% of data points are erroring out but on top of that there are the Icinga retries that further delay the SOFT > HARD state which will eventually trigger the notification. There was 3 checks 2 minutes apart so the notification is send 6 minutes after the statsd window is marked in error.

I guess this task was to have Icinga to issue the notification immediately (0 retries) since the statsd window acts a buffer already. Then all of that was more than 4 years ago so should probably be entirely revisited and indeed we can just ignore the issue unless that becomes a concern again.

	F4314919: 20160728-spike-of-exceptions.png
	Jul 28 2016, 1:15 PM

"MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!)Closed, ResolvedPublicActions

Description

Event Timeline

"MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!)
Closed, ResolvedPublic
Actions