Page MenuHomePhabricator

500 generated by Netbox while running the decom cookbook
Open, MediumPublic

Description

On the cookbook side:

elukey@cumin1001:~$ sudo cookbook sre.hosts.decommission analytics1054.eqiad.wmnet -t T267932
START - Cookbook sre.hosts.decommission
ATTENTION: destructive action for 1 hosts: analytics1054.eqiad.wmnet
Are you sure to proceed?
Type "done" to proceed
> done
Looking for matches in puppetmaster1001.eqiad.wmnet:/var/lib/git/operations/puppet
modules/install_server/files/dhcpd/linux-host-entries.ttyS1-115200:    fixed-address analytics1054.eqiad.wmnet;
Looking for matches in puppetmaster1001.eqiad.wmnet:/srv/private
Looking for matches in deploy1001.eqiad.wmnet:/srv/mediawiki-staging
Found match(es) in the Puppet or mediawiki-config repositories (see above), proceed anyway?
Type "done" to proceed
> done
Looking for Kerberos credentials on KDC kadmin node.
HTTP/analytics1054.eqiad.wmnet@WIKIMEDIA
hdfs/analytics1054.eqiad.wmnet@WIKIMEDIA
yarn/analytics1054.eqiad.wmnet@WIKIMEDIA
Please follow this guide to drop unused credentials: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Delete_Kerberos_principals_and_keytabs_when_a_host_is_decommissioned
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['analytics1054.eqiad.wmnet']
Downtimed host on Icinga
Management Password:
Found physical host
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: ['analytics1054.mgmt.eqiad.wmnet']
Downtimed management interface on Icinga
**Failed to wipe bootloaders, manual intervention required to make it unbootable**: Cumin execution failed (exit_code=2)
Running IPMI command: ipmitool -I lanplus -H analytics1054.mgmt.eqiad.wmnet -U root -E chassis power off
Powered off
Disable and reset potential vlans on asw2-a3-eqiad:ge-3/0/28 for local eno1
Delete IP 10.64.5.17/24 on eno1
Delete IP 2620:0:861:104:10:64:5:17/64 on eno1
Failed to call 'cookbooks.sre.hosts.decommission.update_netbox' [1/4, retrying in 3.00s]: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
Disable and reset potential vlans on asw2-a3-eqiad:ge-3/0/28 for local eno1
Failed to call 'cookbooks.sre.hosts.decommission.update_netbox' [2/4, retrying in 9.00s]: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
Disable and reset potential vlans on asw2-a3-eqiad:ge-3/0/28 for local eno1
Failed to call 'cookbooks.sre.hosts.decommission.update_netbox' [3/4, retrying in 27.00s]: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.
Disable and reset potential vlans on asw2-a3-eqiad:ge-3/0/28 for local eno1
Host steps raised exception
Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 343, in run
    dcs.add(_decommission_host(fqdn, spicerack, reason))
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 146, in _decommission_host
    update_netbox(netbox, netbox_data, spicerack.dry_run)
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 103, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 231, in update_netbox
    device.save()
  File "/usr/lib/python3/dist-packages/pynetbox/core/response.py", line 391, in save
    if req.patch({i: serialized[i] for i in diff}):
  File "/usr/lib/python3/dist-packages/pynetbox/core/query.py", line 409, in patch
    return self._make_call(verb="patch", data=data)

On the Netbox side:

[2020-11-24T08:43:18] Internal Server Error: /api/dcim/devices/253/
Traceback (most recent call last):
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 243, in _commit
    return self.connection.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "dcim_device" violates foreign key constraint "dcim_device_primary_ip4_id_2ccd943a_fk_ipam_ipaddress_id"
DETAIL:  Key (primary_ip4_id)=(3203) is not present in table "ipam_ipaddress".


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/viewsets.py", line 114, in view
    return self.dispatch(request, *args, **kwargs)
  File "./utilities/api.py", line 329, in dispatch
    return super().dispatch(request, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 505, in dispatch
    response = self.handle_exception(exc)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 465, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 476, in raise_uncaught_exception
    raise exc
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/views.py", line 502, in dispatch
    response = handler(request, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/mixins.py", line 82, in partial_update
    return self.update(request, *args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/mixins.py", line 68, in update
    self.perform_update(serializer)
  File "./utilities/api.py", line 368, in perform_update
    return super().perform_update(serializer)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/mixins.py", line 78, in perform_update
    serializer.save()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/rest_framework/serializers.py", line 207, in save
    self.instance = self.update(self.instance, validated_data)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/taggit_serializer/serializers.py", line 101, in update
    instance, validated_data)
  File "./extras/api/customfields.py", line 203, in update
    instance.custom_fields = custom_fields
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/cacheops/transaction.py", line 82, in __exit__
    self._no_monkey.__exit__(self, exc_type, exc_value, traceback)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/transaction.py", line 232, in __exit__
    connection.commit()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 267, in commit
    self._commit()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 243, in _commit
    return self.connection.commit()
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/srv/deployment/netbox/venv/lib/python3.7/site-packages/django/db/backends/base/base.py", line 243, in _commit
    return self.connection.commit()
django.db.utils.IntegrityError: insert or update on table "dcim_device" violates foreign key constraint "dcim_device_primary_ip4_id_2ccd943a_fk_ipam_ipaddress_id"
DETAIL:  Key (primary_ip4_id)=(3203) is not present in table "ipam_ipaddress".

Event Timeline

Note that the switch port got updated correctly.

Apart seeing if those errors goes away with the upcoming Netbox upgrade, we could try to invert the steps and mark it first as decommissioning and then removing the IPs, so that we don't need to modify the object right after removing the IPs and let the Netbox cache be updated at its own pace.

Apart seeing if those errors goes away with the upcoming Netbox upgrade, we could try to invert the steps and mark it first as decommissioning and then removing the IPs, so that we don't need to modify the object right after removing the IPs and let the Netbox cache be updated at its own pace.

+1

Change 648259 had a related patch set uploaded (by Volans; owner: Volans):
[operations/cookbooks@master] sre.hosts.decommission: try to avoid Netbox issue

https://gerrit.wikimedia.org/r/648259

Change 648259 merged by jenkins-bot:
[operations/cookbooks@master] sre.hosts.decommission: try to avoid Netbox issue

https://gerrit.wikimedia.org/r/648259

Volans triaged this task as Medium priority.Dec 14 2020, 8:43 AM

The above patch should have fixed it for now, leaving it open for a bit to see if we get any re-occurrence.