Page MenuHomePhabricator

wmf-auto-reimage: downtime expiration after failed install
Closed, DeclinedPublic

Description

Last night I started a reimage before going to sleep. The reimage got as far as running puppet but ultimately failed. Hours later (around 2AM my time) the downtimes set by the reimage script expired, causing a flurry of middle-of-the-night alerts.

I definitely should have seen that coming! Nevertheless, unattended overnight installs seem like a potentially useful thing for wmf-auto-reimage to support. Would it make sense to have the default pre-reimage downtime interval be 12 hours, or 72 hours, or some other big, long window, rather than the default 2 hours?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The idea behind it was to avoid surprises after the reimage is long gone and maybe the host has even been put back into production without noticing that something is still "red".
With the migration of the reimage scripts to cookbook (in progress) we will be able to take advantage of additional spicerack features, like checking the icinga state of a host (that doesn't include though checks not attached to the host in icinga terms) and even remove automatically the downtime once it's all "green".
In that case, if after a while the host has still some failing checks we could do something "specific" to be decided.

I personally don't think that a longer downtime solves the issue, because the probability of forgetting about is directly related to the length of the downtime, risking to cause even more surprise notifications, but I might be convinced otherwise.

Another possible approach, already available, is to set the icinga notification flag off in hiera. Not ideal but if the reimage with the first puppet run doesn't give you back a healthy host might be another solution.

nskaggs triaged this task as Low priority.

I personally don't think that a longer downtime solves the issue, because the probability of forgetting about is directly related to the length of the downtime, risking to cause even more surprise notifications, but I might be convinced otherwise.

I agree that it increases the risk of forgetting. My main question here is about using this script unattended/overnight. As it is it's not safe to run in that mode. Maybe we could add a switch that specifies the post-rebuild downtime so I could explicitly say "don't page me about this until I wake up?"