Page MenuHomePhabricator

Decommission cookbook: stop when user inputs "abort"
Open, MediumPublic

Description

While I was decommissioning Elastic hosts in T358882 , I made an operational error which caused one of our Elastic clusters to lose quorum (see this incident) .

I realized the problem was related to the decommission, so I wanted to stop the cookbook ASAP. The cookbook was sitting at a prompt which read Type "go" to proceed or "abort" to interrupt the execution. I tried to halt the cookbook by typing 'abort'. But instead of stopping, the cookbook skipped to the next step and wiped the filesystem on one of the master hosts, which significantly complicated the recovery effort.

Creating this ticket to request that the sre.hosts.decommission cookbook completely aborts when the user inputs "abort".

I've captured the decommission logs and my tmux buffer in /home/bking/decom/ on cumin2002. Line 1977 on the tmux buffer shows my "abort" input.

Thanks for taking a look, please let me know if you need more info.

Event Timeline

I've looked at the logs and the code, some clarification/questions/comments:

  1. because the cookbook was prompting the user, it means it was already stopped, waiting for user input. If no answer would have been entered the cookbook would have stayed there doing nothing, allowing for the operator to investigate the situation.
  2. the abort there is to interrupt the execution of that command raising an exception, what a cookbook would do after that is outside of the scope of the confirmation asking. In particular that confirmation was to commit or not the interface changes on the switch.
    • Was the decommission cookbook re-run on the aborted host (elastic2050) after the incident was resolved to ensure all the decom steps were performed?
  3. the current implementation of the decommission cookbook is to execute the decom on all selected hosts catching any exception from the single host run and reporting them at the end. It could of course be changed maybe to prompt the user for confirmation to continue or not on error.
    • The current implementation doesn't catch a Ctrl+c, so that would have interrupted the cookbook execution all-together.
  4. the decommission cookbook performs destructive actions and as such has already various warnings and prompts to the user to make sure is not run on the wrong hosts. The incident report (as of now) doesn't clarify why the cookbook was run on the wrong hosts and what could have prevented it.

Hello Volans, thanks for the quick response. I'll try to respond inline.

If no answer would have been entered the cookbook would have stayed there doing nothing, allowing for the operator to investigate the situation.

I thought it would be safer to abort the cookbook altogether at that point rather than leaving it hanging. I agree that I should have left it alone.

the abort there is to interrupt the execution of that command raising an exception,

If you were a user who did not run these cookbooks every day and was not familiar with the inner workings, what would your expectations be around the message Type "go" to proceed or "abort" to interrupt the execution?

the decommission cookbook performs destructive actions and as such has already various warnings and prompts to the user

To be clear, the cookbook itself is not to blame for my error. I had already agreed to the destructive actions, but I thought I was aborting the cookbook and preventing further damage. Instead, it wiped the filesystem of the active master.

The incident report (as of now) doesn't clarify why the cookbook was run on the wrong hosts and what could have prevented it.

The cookbook was running on the correct hosts, the problem was that I removed the hosts from Puppet before failing over the master.

What I am requesting is output that aligns with the principle of least surprise. I'm not likely to make this mistake again, so if you don't feel it is worth the effort, feel free to close this one out.

Change #1018718 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.hosts.decommission: ask on failure

https://gerrit.wikimedia.org/r/1018718

Volans triaged this task as Medium priority.