Page MenuHomePhabricator

sre.hosts.decommission: don't FAIL when unable to set icinga downtime
Open, MediumPublic

Description

I noticed during a recent sre.hosts.decomission that the run was marked as failed since the script was unable to downtime the host in icinga.

Since this step is already noted "(likely already removed)" I'll suggest we make the downtime steps optional in a way that they would not affect the overall exit status of the run if setting downtime fails.

Here's a recent example https://phabricator.wikimedia.org/T279602#7062271

Event Timeline

Thanks for the task, while it's true that it could have been already removed from Icinga, it might happen for other reasons too and I'm not sure if hiding it completely is a great choice. We could look to see if we could improve the check and distinguish between a missing host and other failures.
Do you happen to know why that step failed in this specific case? And if was that the host was already not anymore in Icinga, do you know why?
In the usual workflow a host to be decommissioned should be in Icinga.

Volans triaged this task as Medium priority.May 5 2021, 5:05 PM

I'm not sure if hiding it completely is a great choice. We could look to see if we could improve the check and distinguish between a missing host and other failures.

Sounds good, yeah I would want to know if it happens as well. Maybe we could consider the step optional? In other words still report on it, but its outcome wouldn't affect the overall PASS/FAIL status.

Do you happen to know why that step failed in this specific case? And if was that the host was already not anymore in Icinga, do you know why?

It's a bit of a weird case. These hosts (icinga[12]001) had role::alerting_host but were not the active_host at the time sre.hosts.decommission was run, so they didn't exist in icinga at all. alert2001 is another host in this non-active state currently (but not planning to decom alert2001 any time soon)

I'm not sure if hiding it completely is a great choice. We could look to see if we could improve the check and distinguish between a missing host and other failures.

Sounds good, yeah I would want to know if it happens as well. Maybe we could consider the step optional? In other words still report on it, but its outcome wouldn't affect the overall PASS/FAIL status.

That's an option, sure. I can do that, but I'm pretty sure nobody would look at the logs/task at all if it say PASS, hence my doubt this is the right move if this is just to cover a very special corner case vs something that happens often for other reasons (broken hosts out of puppetdb).

Do you happen to know why that step failed in this specific case? And if was that the host was already not anymore in Icinga, do you know why?

It's a bit of a weird case. These hosts (icinga[12]001) had role::alerting_host but were not the active_host at the time sre.hosts.decommission was run, so they didn't exist in icinga at all. alert2001 is another host in this non-active state currently (but not planning to decom alert2001 any time soon)

Right, the icinga passive host not being monitored... definitely a snowflake :)