Page MenuHomePhabricator

sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed
Closed, ResolvedPublic

Description

[cumin1001:~] $ sudo cookbook sre.hosts.decommission -t T274023 mwdebug1002.eqiad.wmnet
START - Cookbook sre.hosts.decommission
>>> ATTENTION: destructive action for 1 hosts: mwdebug1002.eqiad.wmnet
Are you sure to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
----- OUTPUT of 'cd /var/lib/git/...46[^0-9A-Za-z])'' -----                                                                
conftool-data/node/eqiad.yaml:    mwdebug1002.eqiad.wmnet: [apache2]                                                       
modules/install_server/files/dhcpd/linux-host-entries.ttyS0-115200:    fixed-address mwdebug1002.eqiad.wmnet;              
modules/profile/files/trafficserver/x-wikimedia-debug-routing.lua:        ["mwdebug1002.eqiad.wmnet"] = "mwdebug1002.eqiad.wmnet",                                                                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.23hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /var/lib/git/...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
----- OUTPUT of 'cd /srv/private ...46[^0-9A-Za-z])'' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.72hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/private ...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
----- OUTPUT of 'cd /srv/mediawik...46[^0-9A-Za-z])'' -----                                                                
debug.json:    "mwdebug1002.eqiad.wmnet",                                                                                  
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.16s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'cd /srv/mediawik...46[^0-9A-Za-z])''.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?
Type "go" to proceed or "abort" to interrupt the execution
> go
Looking for Kerberos credentials on KDC kadmin node.
----- OUTPUT of 'find /srv/kerber...02.eqiad.wmnet*"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.79hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'find /srv/kerber...02.eqiad.wmnet*"'.
----- OUTPUT of '/usr/local/sbin/...02.eqiad.wmnet*"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.88hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/usr/local/sbin/...02.eqiad.wmnet*"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
No Kerberos credentials found.
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['mwdebug1002.eqiad.wmnet']
----- OUTPUT of 'icinga-downtime ...n1001 - T274023"' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  2.45hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...n1001 - T274023"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Downtimed host on Icinga
Found Ganeti VM
Shutting down VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet
----- OUTPUT of 'gnt-instance shu...1002.eqiad.wmnet' -----                                                                
Waiting for job 1134523 for mwdebug1002.eqiad.wmnet ...                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:10<00:00, 10.66s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:10<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance shu...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
VM shutdown
----- OUTPUT of 'systemctl start ...iad_sync.service' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.87hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Sleeping for 20s to avoid race conditions...
Removed host mwdebug1002.eqiad.wmnet from Debmonitor
Removed from DebMonitor
----- OUTPUT of 'puppet node clea...1002.eqiad.wmnet' -----                                                                
Notice: Revoked certificate with serial 2340                                                                               
Notice: Revoked certificate with serial 3962                                                                               
Notice: Revoked certificate with serial 5498                                                                               
mwdebug1002.eqiad.wmnet                                                                                                    
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.96s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node clea...1002.eqiad.wmnet'.
----- OUTPUT of 'puppet node deac...1002.eqiad.wmnet' -----                                                                
Submitted 'deactivate node' for mwdebug1002.eqiad.wmnet with UUID db4a3b35-c781-4438-ae8e-618a1689d227                     
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.83s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet node deac...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Removed from Puppet master and PuppetDB
Issuing Ganeti remove command, it can take up to 15 minutes...
Removing VM mwdebug1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet. This may take a few minutes.
----- OUTPUT of 'gnt-instance rem...1002.eqiad.wmnet' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:05<00:00,  5.50s/hosts]          
FAIL |                                                                           |   0% (0/1) [00:05<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'gnt-instance rem...1002.eqiad.wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
VM removed
----- OUTPUT of 'systemctl start ...iad_sync.service' -----                                                                
================                                                                                                           
PASS |███████████████████████████████████████████████████████████████████| 100% (1/1) [00:00<00:00,  1.93hosts/s]          
FAIL |                                                                           |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl start ...iad_sync.service'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----                                                                
2021-02-12 22:56:23,745 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox                           
2021-02-12 23:02:15,871 [ERROR] Failed to run                                                                              
Traceback (most recent call last):                                                                                         
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main                                     
    batch_status, ret_code = run_commit(args, config, tmpdir)                                                              
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit                               
    netbox.collect()                                                                                                       
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 156, in collect                                  
    self._collect_device(device, True)                                                                                     
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 197, in _collect_device                          
    if self.addresses[primary.id].dns_name:                                                                                
KeyError: 6398                                                                                                             
================                                                                                                           
PASS |                                                                           |   0% (0/1) [05:53<?, ?hosts/s]          
FAIL |██████████████████████████████████████████████████████████████████| 100% (1/1) [05:53<00:00, 353.00s/hosts]
100.0% (1/1) of nodes failed to execute command 'cd /tmp && runus...e asset tag one"': netbox1001.wikimedia.org
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'cd /tmp && runus...e asset tag one"'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Failed to run the sre.dns.netbox cookbook
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 365, in run
    dns_netbox_run(dns_netbox_args, spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/dns/netbox.py", line 73, in run
    results = netbox_host.run_sync(command, is_safe=True)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 475, in run_sync
    batch_sleep=batch_sleep, is_safe=is_safe)
  File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute
    raise RemoteExecutionError(ret, 'Cumin execution failed')
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
**Failed to run the sre.dns.netbox cookbook**: Cumin execution failed (exit_code=2)
ERROR: some step failed, check the task updates.
Updated Phabricator task T274023
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolved tstarling
ResolvedJdforrester-WMF
StalledNone
ResolvedNone
ResolvedPRODUCTION ERRORLegoktm
Resolved tstarling
ResolvedJoe
ResolvedKrinkle
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedDzahn
Resolvedhashar
ResolvedJdforrester-WMF
ResolvedLadsgroup
ResolvedMoritzMuehlenhoff
Resolvedjijiki
ResolvedMoritzMuehlenhoff
ResolvedTrizek-WMF
ResolvedDzahn
Resolved Gilles
ResolvedDzahn
ResolvedRequestPapaul
Resolvedjijiki
DeclinedNone
ResolvedDzahn
ResolvedDzahn
ResolvedPapaul
Resolved Cmjohnson
ResolvedRequest Cmjohnson
ResolvedRequestPapaul
ResolvedAndrew
ResolvedArielGlenn
ResolvedDzahn
ResolvedLegoktm
ResolvedPapaul
ResolvedDzahn
ResolvedVolans
ResolvedDzahn
ResolvedLegoktm

Event Timeline

Correct me if I'm wrong, but is this the vm that is replacing another vm of the same name? There might be some assumptions about the connectivity in netbox of things.

It's just an attempt tp remove an existing VM. (after reimaging the existing VM resulted in it not coming back from reboot)

Mentioned in SAL (#wikimedia-operations) [2021-02-13T00:26:49Z] <mutante> ganeti - attempting to recreate VM mwdebug1002 with cookbook that wsa previously deleted manually (T274689 T274023)

summary:

  • existing VM, actually very old from 2016, works fine
  • tried to install new distro version on it, which was no problem on other VMs in codfw, but this one simply did not come back from reboot
  • checked console.. nothing... gnt-instance said status is UP, Icinga disagrees says it is down, can't ssh to it
  • try to restart it again.. nothing on console, nothing happens
  • decide to just delete it, use decom script I am supposed to use for removing VMs
  • decom script fails with errors reported here
  • manually delete it with gnt-instance to attempt to then recreate it with makevm under the same name
  • makevm cookbook suggest to add a public IP even though I said I need private:
sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 4 --disk 50 --network private eqiad_A mwdebug1002.eqiad.wmnet
+mwdebug1002                              1H IN A 208.80.154.6
+mwdebug1002                              1H IN AAAA 2620:0:861:1:208:80:154:6
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:46

I ABORT because that is clearly wrong, it's not supposed to have a public IP.

I am stuck now how I should properly resolve this.

I'm doing a little looking around on this.

the last part about the public IP might have been just about the ordering of parameters I had in makevm.. give me a minute, trying that one more time

Nope, it wasn't. It is trying to assign a public IP again.

Cas is cleaning up netbox and we will try it again.

+mwdebug1002                              1H IN A 10.64.0.93                                                                                
+mwdebug1002                              1H IN A 208.80.154.6                                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:1:208:80:154:6                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:46                                                              
+mwdebug1002                              1H IN AAAA 2620:0:861:101:10:64:0:93                                                              
 mwdebug1003                              1H IN A 10.64.32.9                                                                                
 mwdebug1003                              1H IN AAAA 2620:0:861:103:10:64:32:9

I've successfully got makevm to work as expected after deleting the IP addresses for mwdebug1002 that were left behind by the DNS generation at the decom step failing. That code failure in the DNS generation may be some vm/physical confusion, but it's not obvious from the code why this would have happened.

To be clear about the timeline:

  • Daniel attempted to reimage ganeti host which failed
  • Attempted to reboot ganeti host which failed
  • Attempted to decom ganeti host which mostly worked (the host was removed, but DNS generation failed)
  • Attempted to create a new ganeti host under the name (which added a public IP address for some reason) and was aborted at DNS generation due to weird diff above
  • Repeated attempt, aborted for same reason
  • I removed IP addresses from Netbox and reattempted makevm which worked.

So I think Daniel is unblocked, but there are open questions:

  • Why did decom fail for this box?
  • Why would makevm get a public address if the private address already existed, and/or why would it do this at all if --network private is passed? Or did this happen?

Also there's a notable UX issue with makevm, in that when prompted to review the diff from the DNS generation, aborting does not actually clean up the changes that happened prior, so the addresses, like it or not have already been allocated, which is not entirely the expected behavior.

This is also not an expected use case (basically recreating the same box), but I think it's a reasonable use case which currently may always necessitate some manual cleanup of leftover IPs - it might be better to be able to use makevm to create a new VM with existing IPs as an option. There could be safeguards like checking if the IP is currently unconnected to a VM and there is no ganeti host under the name requested, but it seems a reasonable compromise.

Yes the addition of a revert of netbox changes on failure in the makevm cookbook was already in the TODO, I didn't check if it has already a task but is something we should definitely do.

@crusnov Anything I should do for this task?

for the record, yes, I am unblocked and this issue did NOT show up when I reimaged mwdebug1001. Simply because that VM came back from reboot as normally expected. So that did not trigger trying to delete a VM.

The issue that stays though is if this will happen again when the next VM is decom'ed.

Change 668505 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Change 668505 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: temporary fix for Netbox

https://gerrit.wikimedia.org/r/668505

Volans claimed this task.

The additional sleep in the above patch should have workaround the issue. Resolving it for now, feel free to reopen if it happens again.