Page MenuHomePhabricator

Host decommission improvements
Open, NormalPublic0 Story Points

Description

As a follow up from the last SRE Summit, here the agreed steps to simplify the host decommissioning process:

  • Enhance the decommissioning cookbook to add:
    • Wipe bootloaders to prevent host from rebooting again
    • Shutdown host
    • Set Netbox state to Decommissioning
  • Update the Server_Lifecycle page on Wikitech
  • Update the Phabricator template for host decommissioning

Event Timeline

Volans triaged this task as Normal priority.Aug 23 2019, 9:18 AM
Volans created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 23 2019, 9:18 AM

Change 531897 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: enhance capabilities

https://gerrit.wikimedia.org/r/531897

Volans moved this task from Backlog to In Progress on the SRE-tools board.Aug 23 2019, 10:31 AM

Hi @Volans - I was wondering in the mean time, would it be possible to give all the FTE dc-ops engineers the necessary permissions to install and decom hosts from beginning to end? Maybe either by adding these rights to a dc-ops group or granting root access for Papaul? He's definitely going to need the ability to do all this in the next 1.5 months, since he'll be in Amsterdam refreshing the entire site. Thanks, Willy

Volans added a comment.Fri, Sep 6, 4:34 PM

@wiki_willy the related patch above should already help a lot, but as you know I'm off those days and I cannot give it the necessary testing for merging it, but if anyone else want to volunteer to merge+test it is welcome ;) Otherwise I'll take care of it as soon as I'm back.
As a workaround clearly is possible to add more permissions to dcops, it's a trivial change in puppet that anyone can do, but it's not to me to decide, that should be considered an access request to be decided by the owners of the group (SRE, usually discussed in the weekly meeting).

Change 531897 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: enhance capabilities

https://gerrit.wikimedia.org/r/531897

I've tested the cookbook with lithium (T229557) and it worked great: Puppetdb/Debmonitor entries were removed, the Puppet cert revoked, Netbox was correctly updated to "Decomissioning" and the server was powered off. I connected to the mgmt interface and powered it up manually to validate that it's unbootable, the server correctly failed to boot from disk; it stalled for a few seconds over "boot: " and then fell back to PXE boot. The first time I really enjoyed an unbootable system!

I ran another test with iron (T220505) and it worked fine as well: Puppetdb/Debmonitor entries were removed, the Puppet cert revoked, Netbox was correctly updated to "Decomissioning" and the server was powered off. I connected to the mgmt interface and powered it up manually to validate that it's unbootable and manually tested both SATA disks as boot device in BIOS Boot Manager, both failed \o/