decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MoritzMuehlenhoff
	Aug 7 2019, 8:17 AM

Description

I tried to use the sre.hosts.decommission to remove an obsolete Ganeti VM, which showed a few issues:

First I tried to run the dry-run mode as documented on https://wikitech.wikimedia.org/wiki/Decom_script:

cumin2001:~# cookbook -d sre.hosts.decommission poolcounter1003.eqiad.wmnet -t 224572
DRY-RUN: Executing cookbook sre.hosts.decommission with args: ['poolcounter1003.eqiad.wmnet', '-t', '224572']
DRY-RUN: START - Cookbook sre.hosts.decommission
DRY-RUN: Resolved CNAME record for icinga.wikimedia.org: icinga.wikimedia.org. 300 IN CNAME icinga1001.wikimedia.org.
DRY-RUN: Executing commands ['puppet node clean poolcounter1003.eqiad.wmnet', 'puppet node deactivate poolcounter1003.eqiad.wmnet'] on 1 hosts: puppetmaster1001.eqiad.wmnet
DRY-RUN: Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['poolcounter1003.eqiad.wmnet']
DRY-RUN: Executing commands ['icinga-downtime -h "poolcounter1003" -d 14400 -r "Host decommission - jmm@cumin2001 - 224572"'] on 1 hosts: icinga1001.wikimedia.org
DRY-RUN: Skip removing host poolcounter1003.eqiad.wmnet from Debmonitor in DRY-RUN
DRY-RUN: Skip updating Phabricator task 224572 in DRY-RUN with comment: cookbooks.sre.hosts.decommission executed by jmm@cumin2001 for hosts: `poolcounter1003.eqiad.wmnet`
- Removed from Puppet master and PuppetDB
- Downtimed host on Icinga
- No management interface found (likely a VM)
- Removed from DebMonitor
DRY-RUN: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)

I would have expected that instead of "Executing commands ['puppet node clean.." and "Executing commands ['icinga-downtime ..", these should have also printed "Skip foo in DRY-RUN"

Then I ran the decom cook book without the dry-run option:

cumin2001:~# cookbook sre.hosts.decommission poolcounter1003.eqiad.wmnet -t 224572
START - Cookbook sre.hosts.decommission
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['poolcounter1003.eqiad.wmnet']
Removed host poolcounter1003.eqiad.wmnet from Debmonitor
Updated Phabricator task 224572
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)

I would have expected that it would also print the steps for removing from Puppet/PuppetDB and downtiming in Icinga now. I assume that's because dry-run wasn't applied to them earlier. Looking at the cook book, the Icinga and Puppetmaster actions are from other Spicerack modules (icinga and puppet_master), so they either don't get the dry run flag correctly passed or they miss support for it. If they miss support, then the cook book should rather reject running with "-d" than changing things while told not to do that.

Also, the server was not correctly removed from PuppetDB. It's still e.g. visible from Cumin/PuppetDB

cumin2001:~$ sudo cumin poolcounter1003*
1 hosts will be targeted:
poolcounter1003.eqiad.wmnet
DRY-RUN mode enabled, aborting

and I can see it in PuppetDB. The Puppet cert was correctly dropped, but the "puppet node deactivate poolcounter1003.eqiad.wmnet" seems to have gone lost. It worked for an earlier run of the decom script (for poolcounter1001), so maybe it needs some retry or so to ensure the host is correctly removed?

Related Objects

Mentioned Here: T208585: Decommission cp3007-cp3010

Event Timeline

MoritzMuehlenhoff created this task.Aug 7 2019, 8:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 7 2019, 8:17 AM

After running the deactivate step a second time, poolcounter1003 got correctly removed. Looking at PuppetDB logs there might be some kind of race in PuppetDB:

jmm@puppetdb1001:~$ sudo grep " poolcounter1003" /var/log/puppetlabs/puppetdb/puppetdb.log | tail -n 5
2019-08-07 07:36:38,987 INFO  [p.p.command] [3158930-1565163398977] [4 ms] 'deactivate node' command processed for poolcounter1003.eqiad.wmnet
2019-08-07 07:37:26,106 INFO  [p.p.command] [3158993-1565163446040] [60 ms] 'replace facts' command processed for poolcounter1003.eqiad.wmnet
2019-08-07 07:37:32,105 INFO  [p.p.command] [3159010-1565163451663] [404 ms] 'replace catalog' command processed for poolcounter1003.eqiad.wmnet
2019-08-07 07:37:41,910 INFO  [p.p.command] [3159024-1565163461890] [15 ms] 'store report' puppet v5.5.10 command processed for poolcounter1003.eqiad.wmnet
2019-08-07 08:23:01,080 INFO  [p.p.command] [3162626-1565166181066] [7 ms] 'deactivate node' command processed for poolcounter1003.eqiad.wmnet

Maybe a Puppet run was still in action while the "deactivate node" was submitted and that neutralised the deactivation? If that's the case then this will need some workaround on the cook book side. I see two possible options:

In the cook book disable Puppet and make sure that it's not currently running when the cook book gets started
After submitting the "deactivate", wait for n seconds, check whether the puppetdb entry is gone and if not, re-submit the "deactivate"

@MoritzMuehlenhoff the dry-run mode is passed to all modules and all modules must implement unless they only do RO actions.
The difference in logging is because dry-run sets automatically debug logging to stdout, if you look at the file logs (normal and detailed) they show you all the steps.
The "skip" part was not added to the logging given the DRY-RUN in front, but ofc could be done if this is confusing.
PuppetDB has a queue and sure thing it has an issue to be investigated, see https://grafana.wikimedia.org/d/000000477/puppetdb?panelId=19&fullscreen&orgId=1&from=1564236470272&to=1565167416489

There's more: Next I ran the cook book for a host for which the dry-run mode had not been used on previously (to rule out that the incomplete dry-run skews the effective run):
(Started at 2019-08-07 08:37:09,859)

cumin2001:~# cookbook sre.hosts.decommission poolcounter2001.codfw.wmnet  -t 224572
START - Cookbook sre.hosts.decommission
Scheduling downtime on Icinga server icinga1001.wikimedia.org for hosts: ['poolcounter2001.codfw.wmnet']
Removed host poolcounter2001.codfw.wmnet from Debmonitor
Updated Phabricator task 224572
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)

The Puppet cert was removed, the PuppetDB entry remained (same issue as above), but this time also the Debmonitor entry wasn't removed: https://debmonitor.wikimedia.org/hosts/poolcounter2001.codfw.wmnet shows all the data still.

The removal in Debmonitor has a similar race to the PuppetDB removal: I seem to be really lucky, hitting two different races in two subsequent decom runs :-)

jmm@debmonitor1001:/srv/log/debmonitor$ sudo grep poolcounter2001 /srv/log/debmonitor/main.log | tail -n 5
[2019-08-07T08:07:40] [pid: 28697|app: 0|req: 3035/5546] 127.0.0.1 () {42 vars in 710 bytes} [Wed Aug  7 08:07:40 2019] POST /hosts/poolcounter2001.codfw.wmnet/update =>
generated 0 bytes in 202 msecs (HTTP/1.1 201) 6 headers in 397 bytes (1 switches on core 0)
[2019-08-07T08:37:16] [pid: 28699|app: 0|req: 2948/7147] 127.0.0.1 () {40 vars in 620 bytes} [Wed Aug  7 08:37:16 2019] DELETE /hosts/poolcounter2001.codfw.wmnet => generated 0 bytes in 33 msecs (HTTP/1.1 204) 6 headers in 400 bytes (1 switches on core 0)
[2019-08-07T08:37:23] [pid: 28697|app: 0|req: 3908/7151] 127.0.0.1 () {42 vars in 710 bytes} [Wed Aug  7 08:37:22 2019] POST /hosts/poolcounter2001.codfw.wmnet/update =>
generated 0 bytes in 969 msecs (HTTP/1.1 201) 6 headers in 397 bytes (1 switches on core 0)
[2019-08-07T08:50:48] [pid: 28699|app: 0|req: 3273/7891] 127.0.0.1 () {54 vars in 1208 bytes} [Wed Aug  7 08:50:48 2019] GET /hosts/poolcounter2001.codfw.wmnet => generated 16086 bytes in 68 msecs (HTTP/1.1 200) 7 headers in 449 bytes (1 switches on core 0)
[2019-08-07T08:54:13] [pid: 28699|app: 0|req: 3341/8057] 127.0.0.1 () {54 vars in 1209 bytes} [Wed Aug  7 08:54:13 2019] GET /hosts/poolcounter2001.codfw.wmnet => generated 16086 bytes in 51 msecs (HTTP/1.1 200) 7 headers in 449 bytes (1 switches on core 0)

The "08:07:40" and "08:37:22" are most certainly from the Cron-triggered /usr/local/sbin/puppet-run which triggers "apt-get update", so the DELETE from 08:37:16 was overwritten/neutralised by the 08:37:22. I see a few possible options:

In the cook book disable the /etc/cron.d/puppet Crontab entry. This will fix the race for Cron-triggered "apt-get update", but we could still hit other races, e.g. if a host gets decommed while someone runs a fleet-wide/DC-wide debdeploy update, but it still greatly reduces it assume
After submitting the "curl -X DELETE", wait for n seconds, check whether the host entry is gone and if not, re-submit the curl
Or maybe this rather needs a fix in Debmonitor to fix it for good, any POSTs which arrived after a DELETE could be discarded

MoritzMuehlenhoff renamed this task from decom cookbook: dry-run mode not working / PuppetDB removal failed to decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail.Aug 7 2019, 9:10 AM

• ema subscribed.Aug 7 2019, 9:16 AM

The solution that was agreed at the SRE summit for this is to add a dd to override the bootloader(s) so that the host cannot boot anymore and perform a shutdown of the host (most likely via IPMI, for VMs probably directly via ganeti).
At that point we can revoke certs in puppet, remove from debmonitor, etc... without the risk of any race or re-appearance.

I ACKed the Netbox/PuppetDB alert (missing VM from Netbox: poolcounter2001) linking to that task.

CDanis triaged this task as Medium priority.Aug 16 2019, 12:59 PM

In T229998#5398742, @Volans wrote:

The solution that was agreed at the SRE summit for this is to add a dd to override the bootloader(s) so that the host cannot boot anymore and perform a shutdown of the host (most likely via IPMI, for VMs probably directly via ganeti).
At that point we can revoke certs in puppet, remove from debmonitor, etc... without the risk of any race or re-appearance.

The same race also occured with cp3008 (which uses the new "dd" approach), where puppetdb and debmonitor entries were not cleaned out. The host is a bit of a special case, though; it was already down due to an earlier hardware failure, so maybe some of the fallback paths in the cookbook didn't get reached or so.

@MoritzMuehlenhoff it failed the power off, as reported by the script, see https://phabricator.wikimedia.org/T208585#5599005

Closing this task, this was addressed later by adding a delay into the decom cookbook which addresses a race condition in the removal (and concurrent re-adding of the puppetdb entry). Also, the dd-ing the boot loader helps here.

decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can failClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

decom cookbook: dry-run mode not working / PuppetDB and Debmonitor removals can fail
Closed, ResolvedPublic
Actions