Change Details

It has been observed multiple times that Icinga alerts can be noisy, especially on IRC and during incidents it can be very distracting. In particular the following should help improving the signal to noise ratio: ===== Replace host-level IRC alerts with equivalent service-level. Especially on IRC there's often no need to have notifications for single hosts, e.g: CPU alerts, dpkg broken, etc. These host-level alerts in some cases make sense aggregated (e.g. per cluster) and/or not to be sent on IRC but shown on icinga UI only. ===== Alerts that page should say so ATM it is impossible to tell whether a given alert has paged folks, a paging alert indicates a serious issue and a certain level of response expected. Thus explicitly paging alerts will help picking out serious issue (e.g. from IRC) ===== [stretch] Downtime hosts from IRC It'll be useful if folks can downtime hosts from IRC in a similar fashion to how we `!log` for example, useful during incidents since we're on IRC anyways and the icinga ui can be clunky/slow, ditto for logging into icinga host and issuing `downtime-host` for each host. ===== De-noise puppet failed runs As of Jul 2019 "puppet failed run" is the most frequent alert on IRCOn the puppet failed runs, AFAICT one of the current failure modes causing the most noise relates to the master throwing 500s, sometimes for legitimate reasons (can't compile catalogs) or brief unavailability (e.g. can't `PUT` reports). (To get an idea/overview on the puppet master frontend: `zgrep -v -F -e /200 -e /404 -e /400 /var/log/apache2/puppetmaster.puppet.log*`). On widespread unavailability (e.g. catalog fails for many hosts, puppetmaster down, etc) we get a lot of puppet failed run spam, and with low signal to noise ratio most of the timeespecially on IRC. A more meaningful alert strategy will involve for exampleThe idea is thus to: [ ] Alert on aggregatinge puppet failures (say at cluster level,e.g. alert if >x% of hosts are failing puppet)CRITICAL if >1% of puppet failed runs for any given cluster) [ ] Relax the current per-host failed run to go CRITICAL only if puppet has been failing for longer than X and/or for the last N runs