Page MenuHomePhabricator

Improve alerting for hosts with Puppet disabled for longer periods
Open, MediumPublic

Description

I'm summarizing here the discussion we had in the SRE Infrastructure Foundation meeting regarding the topic of Puppet disabled for longer periods, its automatic removal from PuppetDB, the inherent removal from our monitoring, the related Netbox report that flags those and the cases not covered by the current alerting.

We've agreed that the removal from PuppetDB (currently set at 2 weeks) should stay, because having Puppet disabled for very long periods is not something that is supported and might cause a range of various issues.

Those are the action items we should follow up on:

  1. Improve the Icinga alerting so that a single host with Puppet disabled for more than a week becomes a critical in the web UI and alerts on IRC, despite of the aggregated Puppet disabled alert ( @jbond )
  2. Find a short term quick solution to ensure that a host with disabled Puppet that disappears from PuppetDB is always flagged by the related Netbox report ( @Volans )
  3. Document clearly that any normal maintenance (including cookbooks) should not need to keep Puppet disabled for more than a day or so. If that's the case their puppetization should be refactored to take into account so that Puppet could be kept running during the maintenances ( @Volans )
  4. Make a more clear contract of in which state should the power (on/off) and the switch port (open/closed) be in each of the Server Lifecycle steps and have a plan to enforce the contract via our automation. (TBD)

Event Timeline

Volans triaged this task as Medium priority.Mar 10 2021, 6:45 PM
Volans created this task.

One option for the longer term could also be to actually generate the list of mgmt hosts to monitor in Icinga from Netbox instead that from PuppetDB... any host with a mgmt IP should be reachable by Icinga (except some small race condition when it's provisione)

Change 902764 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/puppet-agent: Add alertmanager based check for disabled puppet

https://gerrit.wikimedia.org/r/902764

Change 902764 merged by Jbond:

[operations/alerts@master] team-sre/puppet-agent: Add alertmanager based check for disabled puppet

https://gerrit.wikimedia.org/r/902764

Improve the Icinga alerting so that a single host with Puppet disabled for more than a week becomes a critical in the web UI and alerts on IRC, despite of the aggregated Puppet disabled alert ( @jbond )

i have added an alertmanager check which goes critical after puppet is disabled for one week

Document clearly that any normal maintenance (including cookbooks) should not need to keep Puppet disabled for more than a day or so.

I added a bit of text to the runbook i created for the above alert but it likely needs to be improved, advertised and possibly placed somewhere more prominent?