Page MenuHomePhabricator

update failed puppet checks so that they go critical 24 hours
Closed, ResolvedPublic

Description

The icinga puppet test were recently updated so that they would only go into a warning state on a puppet failure and we instead only send a critical alert of a percentage of all hosts go into a failed state. however this has caused us to miss failing hosts which do a specific role and therefore don't trigger the percentage required. As such it would be useful to also go into an alerting state if puppet hasn't run for an extended peroiod of time. e.g. 24 hours

Event Timeline

jbond triaged this task as Medium priority.Oct 25 2019, 12:53 PM
jbond created this task.

Change 546165 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] check_puppetrun: dont alert for diabled puppet agents for 1 day

https://gerrit.wikimedia.org/r/546165

I toko a look at how the mod gets the last puppet run data and iut just dose the following stat -c %Z /var/lib/puppet/state/classes.txt which isn't really usefull for our use case. I dont see any other meta data which gives the last failed state so we may need to store it our self

@jbond I had already opened T236345 for this. I guess that can probably be merged into this at this point.

herron renamed this task from update failed puppet checkes so that they go critical 24 hours to update failed puppet checks so that they go critical 24 hours.Oct 28 2019, 7:02 PM
herron updated the task description. (Show Details)
herron subscribed.

Change 546165 merged by Jbond:
[operations/puppet@production] check_puppetrun: don't alert for disabled puppet agents for 1 day

https://gerrit.wikimedia.org/r/546165

jbond claimed this task.

This is complete, please reopen if further issues/improvements