Page MenuHomePhabricator

Tweak widespread puppet failures for small sites
Closed, ResolvedPublic

Description

ATM small sites like eqsin can trigger the alert even when a single host fails:

07:49  <moritzm> godog: we should adapt thresholds per DC for the "widespread puppet failure" alert, 
                 cp5001 had an alert, which causes 5.5556% puppet failures for all of eqsin (which has 18 
                 servers), which is above the threshold
08:16  <godog> moritzm: indeed! I'm filing a task
08:17  <moritzm> or maybe adapt the logic and have something like ">= x % and >= y hosts"

Event Timeline

Change 536591 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on overall puppet failures

https://gerrit.wikimedia.org/r/536591

Change 536591 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on overall puppet failures

https://gerrit.wikimedia.org/r/536591

fgiunchedi claimed this task.

This is resolved, in the sense that per-site puppet widespread alerts are gone in favor of truly global (i.e. all sites) puppet failures

Change 538836 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: tweak widespread puppet failures thresholds

https://gerrit.wikimedia.org/r/538836

Change 538836 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: tweak widespread puppet failures thresholds

https://gerrit.wikimedia.org/r/538836