Page MenuHomePhabricator

Decommisioning a VM failed with a key error when generating DNS
Closed, ResolvedPublic

Description

After having failed to run the decommissioning script a couple of days ago due to PEBKAC on my side, I 've failed again today

command was:

sudo cookbook sre.hosts.decommission acrux.codfw.wmnet -t T277191

The issue at hand is:

Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
Sleeping for 3 minutes to get Netbox caches in sync
Generating the DNS records from Netbox data. It will take a couple of minutes.
----- OUTPUT of 'cd /tmp && runus...e asset tag one"' -----
2021-03-26 09:36:07,955 [INFO] Gathering devices, interfaces, addresses and prefixes from Netbox
2021-03-26 09:38:24,028 [WARNING] Device frqueue1002 of IP 10.64.40.204/26 not in devices, skipping.
2021-03-26 09:38:24,032 [WARNING] Device phab1003 of IP 10.65.1.16/16 not in devices, skipping.
2021-03-26 09:38:24,041 [ERROR] Failed to run
Traceback (most recent call last):
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 686, in main
    batch_status, ret_code = run_commit(args, config, tmpdir)
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 590, in run_commit
    netbox.collect()
  File "/srv/deployment/netbox-extras/dns/generate_dns_snippets.py", line 170, in collect
    address.assigned_object = self.virtual_interfaces[address.assigned_object_id]
KeyError: 19
================

Couple of informational points:

  • The VMs were already powered off
  • puppet node deactivate had been ran manually already.

Event Timeline

Interestingly, rerun the cookbook for a different host succeeded just fine and even merged the DNS removal of acrux.

akosiaris renamed this task from Decommisioning a VM failed with a key error 19 when generating DNS to Decommisioning a VM failed with a key error when generating DNS.Mar 26 2021, 9:45 AM

I think is another issue with Netbox APIs cache. Basically we get all the data at the start of the cookbook and then merge them in Python to speed up the operations. I can add a try/except there and try to gather the data again on failure.

Volans triaged this task as Medium priority.

Boldly resolving as we've not seen this recently and Netbox has been upgraded multiple times since then.