"Puppet last run" status is currently checked with a 1 minute check interval, 1 minute retry interval, and 3 max check. So it should alert about 3 minutes after a failed puppet run. It's also checking many times in-between runs. Tuning this could help reduce the number of service checks icinga needs to perform.
- How often are run failure alerts actionable i.e. not false positive related to maintenance/already aware of issue?
- Could the check interval be aligned with the puppet run interval of 30 minutes? They would not be synchronized, but perhaps feedback of recent run failure is sufficient.
- Could this check alert after some number of consecutive run failures? This may help reduce inactionable alerts where someone was already actively fixing the issue. A longer time window could allow the fix to happen before alerting.