I tried to use the sre.hosts.decommission to remove an obsolete Ganeti VM, which showed a few issues:
First I tried to run the dry-run mode as documented on https://wikitech.wikimedia.org/wiki/Decom_script:
cumin2001:~# cookbook -d sre.hosts.decommission poolcounter1003.eqiad.wmnet -t 224572 DRY-RUN: Executing cookbook sre.hosts.decommission with args: ['poolcounter1003.eqiad.wmnet', '-t', '224572'] DRY-RUN: START - Cookbook sre.hosts.decommission DRY-RUN: Resolved CNAME record for icinga.wikimedia.org: icinga.wikimedia.org. 300 IN CNAME icinga1001.wikimedia.org. DRY-RUN: Executing commands ['puppet node clean poolcounter1003.eqiad.wmnet', 'puppet node deactivate poolcounter1003.eqiad.wmnet'] on 1 hosts: puppetmaster1001.eqiad.wmnet DRY-RUN: Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['poolcounter1003.eqiad.wmnet'] DRY-RUN: Executing commands ['icinga-downtime -h "poolcounter1003" -d 14400 -r "Host decommission - jmm@cumin2001 - 224572"'] on 1 hosts: icinga1001.wikimedia.org DRY-RUN: Skip removing host poolcounter1003.eqiad.wmnet from Debmonitor in DRY-RUN DRY-RUN: Skip updating Phabricator task 224572 in DRY-RUN with comment: cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter1003.eqiad.wmnet` - Removed from Puppet master and PuppetDB - Downtimed host on Icinga - No management interface found (likely a VM) - Removed from DebMonitor DRY-RUN: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
I would have expected that instead of "Executing commands ['puppet node clean.." and "Executing commands ['icinga-downtime ..", these should have also printed "Skip foo in DRY-RUN"
Then I ran the decom cook book without the dry-run option:
cumin2001:~# cookbook sre.hosts.decommission poolcounter1003.eqiad.wmnet -t 224572 START - Cookbook sre.hosts.decommission Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['poolcounter1003.eqiad.wmnet'] Removed host poolcounter1003.eqiad.wmnet from Debmonitor Updated Phabricator task 224572 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
I would have expected that it would also print the steps for removing from Puppet/PuppetDB and downtiming in Icinga now. I assume that's because dry-run wasn't applied to them earlier. Looking at the cook book, the Icinga and Puppetmaster actions are from other Spicerack modules (icinga and puppet_master), so they either don't get the dry run flag correctly passed or they miss support for it. If they miss support, then the cook book should rather reject running with "-d" than changing things while told not to do that.
Also, the server was not correctly removed from PuppetDB. It's still e.g. visible from Cumin/PuppetDB
cumin2001:~$ sudo cumin poolcounter1003* 1 hosts will be targeted: poolcounter1003.eqiad.wmnet DRY-RUN mode enabled, aborting
and I can see it in PuppetDB. The Puppet cert was correctly dropped, but the "puppet node deactivate poolcounter1003.eqiad.wmnet" seems to have gone lost. It worked for an earlier run of the decom script (for poolcounter1001), so maybe it needs some retry or so to ensure the host is correctly removed?