Detect high server load earlier – prometheus alert?
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lucas_Werkmeister_WMDE
	Feb 26 2018, 8:37 PM

Description

The 2018-02-26 WikibaseQualityConstraints incident resulted in a strong server load spike on replica servers within minutes of deployment of the relevant config change, but incident response only started almost three hours later. There must be something to improve here, but I’m afraid I’m not familiar enough with the operations environment to say what exactly.

@jcrespo said on IRC:

a prometheus alert would be nice

Event Timeline

Lucas_Werkmeister_WMDE created this task.Feb 26 2018, 8:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 26 2018, 8:37 PM

Lucas_Werkmeister_WMDE updated the task description. (Show Details)Feb 26 2018, 10:54 PM

Lucas_Werkmeister_WMDE updated the task description. (Show Details)

Lucas_Werkmeister_WMDE moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.Feb 27 2018, 10:19 AM

Would be nice indeed, my preference would be for something around latency and/or (number of errors) / (number of successes + number of errors)

MoritzMuehlenhoff triaged this task as Medium priority.Mar 2 2018, 8:35 AM

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 9:49 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

High server load isn't a good metric of anything, just a symptom/indication that something is wrong. We now have quite better alerts overall and this hasn't seen a comment in 4+ years, so I am gonna tentatively resolve it. Feel free to reopen

Detect high server load earlier – prometheus alert?Closed, ResolvedPublicActions

Description

Event Timeline

Detect high server load earlier – prometheus alert?
Closed, ResolvedPublic
Actions