On the puppet failed runs, AFAICT one of the current failure modes causing the most noise relates to the master throwing 500s, sometimes for legitimate reasons (can't compile catalogs) or brief unavailability (e.g. can't PUT reports).
(To get an idea/overview on the puppet master frontend: zgrep -v -F -e /200 -e /404 -e /400 /var/log/apache2/puppetmaster.puppet.log*). On widespread unavailability (e.g. catalog fails for many hosts, puppetmaster down, etc) we get a lot of puppet failed run spam, especially on IRC. The idea is thus to:
- Alert on aggregate puppet failures (e.g. CRITICAL if >1% of puppet failed runs for any given cluster)
- Relax the current per-host failed run to go CRITICAL only if puppet has been failing for longer than X and/or for the last N runs