Page MenuHomePhabricator

Detect high server load earlier – prometheus alert?
Closed, ResolvedPublic

Description

The 2018-02-26 WikibaseQualityConstraints incident resulted in a strong server load spike on replica servers within minutes of deployment of the relevant config change, but incident response only started almost three hours later. There must be something to improve here, but I’m afraid I’m not familiar enough with the operations environment to say what exactly.

@jcrespo said on IRC:

a prometheus alert would be nice

Event Timeline

Would be nice indeed, my preference would be for something around latency and/or (number of errors) / (number of successes + number of errors)

akosiaris claimed this task.
akosiaris subscribed.

High server load isn't a good metric of anything, just a symptom/indication that something is wrong. We now have quite better alerts overall and this hasn't seen a comment in 4+ years, so I am gonna tentatively resolve it. Feel free to reopen