I'm summarizing here the discussion we had in the SRE Infrastructure Foundation meeting regarding the topic of Puppet disabled for longer periods, its automatic removal from PuppetDB, the inherent removal from our monitoring, the related Netbox report that flags those and the cases not covered by the current alerting.
We've agreed that the removal from PuppetDB (currently set at 2 weeks) should stay, because having Puppet disabled for very long periods is not something that is supported and might cause a range of various issues.
Those are the action items we should follow up on:
- Improve the Icinga alerting so that a single host with Puppet disabled for more than a week becomes a critical in the web UI and alerts on IRC, despite of the aggregated Puppet disabled alert ( @jbond )
- Find a short term quick solution to ensure that a host with disabled Puppet that disappears from PuppetDB is always flagged by the related Netbox report ( @Volans )
- Document clearly that any normal maintenance (including cookbooks) should not need to keep Puppet disabled for more than a day or so. If that's the case their puppetization should be refactored to take into account so that Puppet could be kept running during the maintenances ( @Volans )
- Make a more clear contract of in which state should the power (on/off) and the switch port (open/closed) be in each of the Server Lifecycle steps and have a plan to enforce the contract via our automation. (TBD)