Page MenuHomePhabricator

Puppet failing on the alert hosts should alert
Open, MediumPublic

Description

A syntax error in one of the prometheus query for monitoring was hitting the All exclamation marks in the query parameter must be escaped e.g. \! check and making Puppet fail on the alert hosts.
The issue has been unnoticed for ~19 hours and got noticed only because an alert for a decommissioned host was triggered.

Due to the special nature of the alert hosts we could consider making them an exception of the aggregated puppet check and last puppet run alerts so that they would alert and be noticed after a shorter amount of time.
I think that both puppet failure or puppet disabled on the alert hosts for more than a couple of hours should be considered a problem. Thoughts?

Event Timeline

Volans triaged this task as Medium priority.Wed, May 19, 10:08 AM
Volans created this task.

I tend to agree that puppet failures on alert hosts are more critical than others, implementation wise I think we could tweak the check_puppet_run thresholds on alert hosts to make failures critical sooner. Alternatively (as I'm writing this I think this is my preferred option) add another alert specifically for puppet failures on alert hosts that goes critical.

lmata moved this task from Inbox to In progress on the observability board.