Page MenuHomePhabricator

Detect high server load earlier – prometheus alert?
Open, MediumPublic

Description

The 2018-02-26 WikibaseQualityConstraints incident resulted in a strong server load spike on replica servers within minutes of deployment of the relevant config change, but incident response only started almost three hours later. There must be something to improve here, but I’m afraid I’m not familiar enough with the operations environment to say what exactly.

@jcrespo said on IRC:

a prometheus alert would be nice