Page MenuHomePhabricator

sre.hosts.reboot-single cookbook removes any and all downtimes after reboot
Closed, DeclinedPublic

Description

During planned reboots of LVS, the process requires manual downtiming of the hosts (with sre.hosts.downtime) - This manual downtime would be expected to remain after calling sre.hosts.reboot-single but gets removed automatically after the completion of the reboot. This causes alerts to fire as services still require manual intervention before we would remove the manual downtime.

Event Timeline

IMHO the solution here is to create a dedicated cookbook for the LVS that has all the logic needed for LVS reboots (reboot first the secondaries, then the primaries, start from lower traffic important ones to the more critical ones, disable puppet, stop pybal, etc...).
It could be either a cookbook that targets a given traffic-level+datacenter or a rolling-restart-reboot one that can go over all of them or a subset of them.

The sre.hosts.reboot-single is in general meant to be used as one off for single hosts that don't have more specialized cookbooks and don't need particular care and pre/post operations.

This highlights the larger problem of the opacity of cookbooks, particularly those that purport to be generalized. Removing downtimes not related to its own operation is overreaching and IMHO the solution is for it to remove only its own downtime.

I don't think that's possible in Icinga due to Icinga "APIs", for the Alermanager downtime it already removes only the downtime created by the cookbook itself.

@BCornwall is there a specific downtime that you have in mind for the LVS servers? So we can have more context.. As Riccardo mentioned the Icinga "API" is not great, any chance that the downtime could become an Alertmanager one?

Thanks for the response, @elukey! Indeed, Icinga would ideally not even be used any more. Since the service in question is planning to be replaced in the upcoming months, it's not worth the porting effort. However, that is a good response: "Why are you using this dead alerting system in the first place? Migrate over to Prometheus/AM".

Now that I see that these are limitations on Icinga itself, I imagine this bug can either be closed or rot until someone else encounters this/complains.

Now that I see that these are limitations on Icinga itself, I imagine this bug can either be closed or rot until someone else encounters this/complains.

Boldly resolving given that Icinga is being phased out.