Page MenuHomePhabricator

De-noise puppet failed runs (Reduce Icinga alert noise goal)
Closed, ResolvedPublic

Description

On the puppet failed runs, AFAICT one of the current failure modes causing the most noise relates to the master throwing 500s, sometimes for legitimate reasons (can't compile catalogs) or brief unavailability (e.g. can't PUT reports).

(To get an idea/overview on the puppet master frontend: zgrep -v -F -e /200 -e /404 -e /400 /var/log/apache2/puppetmaster.puppet.log*). On widespread unavailability (e.g. catalog fails for many hosts, puppetmaster down, etc) we get a lot of puppet failed run spam, especially on IRC. The idea is thus to:

  • Alert on aggregate puppet failures (e.g. CRITICAL if >1% of puppet failed runs for any given cluster)
  • Relax the current per-host failed run to go CRITICAL only if puppet has been failing for longer than X and/or for the last N runs

Event Timeline

Change 526662 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: track total number of puppet resources

https://gerrit.wikimedia.org/r/526662

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.

Change 526662 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: track total number of puppet resources

https://gerrit.wikimedia.org/r/526662

Change 528084 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: aggregate puppet zero resources reported

https://gerrit.wikimedia.org/r/528084

Change 528087 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: stop per-host puppet critical when master has issues

https://gerrit.wikimedia.org/r/528087

Change 528084 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: aggregate puppet zero resources reported

https://gerrit.wikimedia.org/r/528084

Change 526431 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: alert on widespread puppet failures

https://gerrit.wikimedia.org/r/526431

Change 528143 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] kubernetes: expand alert description

https://gerrit.wikimedia.org/r/528143

Change 526431 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: alert on widespread puppet failures

https://gerrit.wikimedia.org/r/526431

Change 528143 merged by Alexandros Kosiaris:
[operations/puppet@production] kubernetes: expand alert description

https://gerrit.wikimedia.org/r/528143

Change 528087 merged by Filippo Giunchedi:
[operations/puppet@production] base: stop per-host puppet critical when master has issues

https://gerrit.wikimedia.org/r/528087

All changes for this are in now, let's keep an eye on the puppet alerts in the following weeks

Change 528719 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: don't CRITICAL on per-host puppet failures

https://gerrit.wikimedia.org/r/528719

Change 528719 merged by Filippo Giunchedi:
[operations/puppet@production] base: don't CRITICAL on per-host puppet failures

https://gerrit.wikimedia.org/r/528719

fgiunchedi claimed this task.

I'm going to boldly resolve this task now, to be reopened if more puppet alert noise is observed