Page MenuHomePhabricator

alert on too many close-to-saturated appservers / apiservers
Closed, ResolvedPublic

Description

Today we had a latency issue that was at least in part due to too many apiservers being close to out of idle workers (which exploded when several of them became out of idle workers):


https://w.wiki/jzT

We should have an alert when >20% (?) of the cluster has less than <25% (?) of their workers free. (Numbers to be tuned based on past saturation events.)

Event Timeline

jijiki added a subscriber: jijiki.

I agree that is a good idea!

As discussed, here's a start on the query: https://w.wiki/k6F
Both thresholds in there need some tuning, but it's a start.

This should wind up being just a single check_prometheus rule defined alongside the existing check_prometheuses for the appservers: https://gerrit.wikimedia.org/g/operations/puppet/+/970baabd6f8b319c138ec0ed670852851150b6d5/modules/profile/manifests/mediawiki/alerts.pp#34

Change 664319 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] (WIP) mediawiki::alerts add alert when 20% of servers is saturated

https://gerrit.wikimedia.org/r/664319

Change 664319 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::alerts add alert when 20% of servers is saturated

https://gerrit.wikimedia.org/r/664319