Page MenuHomePhabricator

decommission analytics10[42-57]
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of server analytics10[42-57].

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

analytics1042

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1043

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1044

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1045

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1046

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1047

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1048

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1049

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1050

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1051

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1052

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1053

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1054

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1055

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1056

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

analytics1057

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal.
  • - remove all remaining puppet references (include role::spare) and all host entries in the puppet repo
  • - remove ALL dns entries except the asset tag mgmt entries.
  • - reassign task from service owner to DC ops team member depending on site of servee.

End service owner steps / Begin DC-Ops team steps:

  • - disable switch port / set to asset tag if host isn't being unracked / remove from switch if being unracked.
  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: switch port configuration removed from switch once system is unracked.
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

Event Timeline

elukey updated the task description. (Show Details)

Change 641195 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] cumin: change target for hadoop-worker-canary

https://gerrit.wikimedia.org/r/641195

Change 641195 merged by Elukey:
[operations/puppet@production] cumin: change target for hadoop-worker-canary

https://gerrit.wikimedia.org/r/641195

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1042.eqiad.wmnet

  • analytics1042.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Host steps raised exception: Failed to save Netbox status for host analytics1042 Active -> decommissioning

ERROR: some step on some host failed, check the bolded items above

In the logs I see:

2020-11-16 15:51:34,829 elukey 28288 [ERROR decommission.py:303 in run] Host steps raised exception
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 133, in put_host_status
    save_result = host.save()
  File "/usr/lib/python3/dist-packages/pynetbox/core/response.py", line 391, in save
    if req.patch({i: serialized[i] for i in diff}):
  File "/usr/lib/python3/dist-packages/pynetbox/core/query.py", line 409, in patch
    return self._make_call(verb="patch", data=data)
  File "/usr/lib/python3/dist-packages/pynetbox/core/query.py", line 274, in _make_call
    raise RequestError(req)
pynetbox.core.query.RequestError: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investi
gate this exception's error attribute.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 300, in run
    dcs.add(_decommission_host(fqdn, spicerack, reason))
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 142, in _decommission_host
    update_netbox(netbox, netbox_data, spicerack.dry_run)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 216, in update_netbox
    netbox.put_host_status(netbox_data['name'], 'Decommissioning')
  File "/usr/lib/python3/dist-packages/spicerack/netbox.py", line 137, in put_host_status
    ) from ex
spicerack.netbox.NetboxAPIError: Failed to save Netbox status for host analytics1042 Active -> decommissioning
2020-11-16 15:51:34,832 elukey 28288 [ERROR actions.py:99 in _action] Host steps raised exception: Failed to save Netbox status for host analytics1042 Active -> decommissioning

And on netbox's main.log:

[2020-11-16T15:51:34] [pid: 22503|app: 0|req: 3446/13332] 127.0.0.1 () {44 vars in 707 bytes} [Mon Nov 16 15:51:34 2020] PATCH /api/dcim/devices/380/ => generated 1970 bytes in 293 msecs (HTTP/1.1 500) 6 headers in 202 bytes (1 switches on core 0)

But can't find more logs. Going to fix this manually. Adding also @Volans :)

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1043.eqiad.wmnet

  • analytics1043.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Failed to wipe bootloaders, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • Host steps raised exception: Failed to save Netbox status for host analytics1043 Active -> decommissioning

ERROR: some step on some host failed, check the bolded items above

But can't find more logs. Going to fix this manually. Adding also @Volans :)

Indeed there are no traceback in the logs, we should revisit the logging config for Netbox to make sure get those at least on 5xx.

As for the error at hand I can't repro it re-running the same code the decommission cookbook runs: netbox.put_host_status(netbox_data['name'], 'Decommissioning'), it works just fine.

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1044.eqiad.wmnet

  • analytics1044.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1045.eqiad.wmnet

  • analytics1045.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Host steps raised exception: Failed to save Netbox status for host analytics1045 Active -> decommissioning

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1047.eqiad.wmnet

  • analytics1047.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Host steps raised exception: Failed to save Netbox status for host analytics1047 Active -> decommissioning

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1048.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1049.eqiad.wmnet

  • analytics1049.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Failed to wipe bootloaders, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • Host steps raised exception: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: analytics1049.eqiad.wmnet

  • analytics1049.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Host steps raised exception: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: analytics1049.eqiad.wmnet

  • analytics1049.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1046.eqiad.wmnet

  • analytics1046.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Host steps raised exception: The requested url: https://netbox.wikimedia.org/api/dcim/devices/analytics1046/ could not be found.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1050.eqiad.wmnet

  • analytics1050.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Failed to power off, manual intervention required: Remote IPMI for analytics1050.mgmt.eqiad.wmnet failed (exit=1): b''
    • Host steps raised exception: The requested url: https://netbox.wikimedia.org/api/dcim/devices/analytics1050/ could not be found.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1042.eqiad.wmnet

  • analytics1042.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Host steps raised exception: Invalid management FQDN analytics1042.mgmt.eqiad.wmnet for analytics1042.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1043.eqiad.wmnet

  • analytics1043.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for analytics1043.mgmt.eqiad.wmnet failed (exit=1): b''
    • Host steps raised exception: A reserved ('id', 'pk', 'limit', 'offset') kwarg was passed. Please remove it try again.

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1047.eqiad.wmnet

  • analytics1047.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1042.eqiad.wmnet

  • analytics1042.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Host steps raised exception: Invalid management FQDN analytics1042.mgmt.eqiad.wmnet for analytics1042.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1043.eqiad.wmnet

  • analytics1043.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1044.eqiad.wmnet

  • analytics1044.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN analytics1044.mgmt.eqiad.wmnet for analytics1044.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1045.eqiad.wmnet

  • analytics1045.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1046.eqiad.wmnet

  • analytics1046.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1047.eqiad.wmnet

  • analytics1047.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN analytics1047.mgmt.eqiad.wmnet for analytics1047.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1048.eqiad.wmnet

  • analytics1048.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1049.eqiad.wmnet

  • analytics1049.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN analytics1049.mgmt.eqiad.wmnet for analytics1049.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1050.eqiad.wmnet

  • analytics1050.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1051.eqiad.wmnet

  • analytics1051.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Failed to power off, manual intervention required: Remote IPMI for analytics1051.mgmt.eqiad.wmnet failed (exit=1): b''
    • Host steps raised exception: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

ERROR: some step on some host failed, check the bolded items above

Change 642276 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove analytics10[42-50] from puppet (decommed)

https://gerrit.wikimedia.org/r/642276

Change 642276 merged by Elukey:
[operations/puppet@production] Remove analytics10[42-50] from puppet (decommed)

https://gerrit.wikimedia.org/r/642276

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1052.eqiad.wmnet

  • analytics1052.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Failed to wipe bootloaders, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1051.eqiad.wmnet

  • analytics1051.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1042.eqiad.wmnet

  • analytics1042.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Host steps raised exception: Invalid management FQDN analytics1042.mgmt.eqiad.wmnet for analytics1042.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1052.eqiad.wmnet

  • analytics1052.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN analytics1052.mgmt.eqiad.wmnet for analytics1052.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1054.eqiad.wmnet

  • analytics1054.eqiad.wmnet (FAIL)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Failed to wipe bootloaders, manual intervention required to make it unbootable: Cumin execution failed (exit_code=2)
    • Powered off
    • Host steps raised exception: The request failed with code 500 Internal Server Error but more specific details were not returned in json. Check the NetBox Logs or investigate this exception's error attribute.

ERROR: some step on some host failed, check the bolded items above

Mentioned in SAL (#wikimedia-operations) [2020-11-24T09:09:49Z] <elukey> drop principals and keytabs for analytics10[42-57] - T267932

Mentioned in SAL (#wikimedia-analytics) [2020-11-24T09:16:03Z] <elukey> drop principals and keytabs for analytics10[42-57] - T267932

analytics1054 is somehow special. It is alerting in Icinga as unhandled CRIT down host with disabled notifications. But only that one. That is probably because the decom cookbook run for it a week ago failed for some reason.

analytics1054 is somehow special. It is alerting in Icinga as unhandled CRIT down host with disabled notifications. But only that one. That is probably because the decom cookbook run for it a week ago failed for some reason.

Yes sorry my bad, I kept some nodes to test some fixes for the decom cokbook with Riccardo, will follow up on 1054!

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1054.eqiad.wmnet

  • analytics1054.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1055.eqiad.wmnet

  • analytics1055.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1056.eqiad.wmnet

  • analytics1056.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1057.eqiad.wmnet

  • analytics1057.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Found physical host
    • Skipped downtime management interface on Icinga (likely already removed)
    • Unable to connect to the host, wipe of bootloaders will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 644478 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] install_server: remove lefovers of analytics105[1-7]

https://gerrit.wikimedia.org/r/644478

Change 644478 merged by Elukey:
[operations/puppet@production] install_server: remove lefovers of analytics105[1-7]

https://gerrit.wikimedia.org/r/644478

elukey updated the task description. (Show Details)

Ready for DCops to decom :)

Change 658087 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove old analytics decommed nodes

https://gerrit.wikimedia.org/r/658087

Change 658087 merged by Elukey:
[operations/puppet@production] Remove old analytics decommed nodes

https://gerrit.wikimedia.org/r/658087

@wiki_willy these are a lot of nodes can free space in eqiad :)

I realized that analytics1053 was not decommed for some reason, so I've ran the cookbook and I got an error while running homer:

Running Homer on asw2-a-eqiad.mgmt.eqiad.wmnet, it takes time ⏳ don't worry
Exception raised while executing cookbook sre.hosts.decommission:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 226, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 19, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hosts/decommission.py", line 380, in run
    reason], check=True)
  File "/usr/lib/python3.7/subprocess.py", line 472, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.7/subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.7/subprocess.py", line 1453, in _execute_child
    restore_signals, start_new_session, preexec_fn)
TypeError: expected str, bytes or os.PathLike object, not Reason

cookbooks.sre.hosts.decommission executed by elukey@cumin1001 for hosts: analytics1053.eqiad.wmnet

  • analytics1053.eqiad.wmnet (FAIL)
    • Failed downtime host on Icinga (likely already removed)
    • Host steps raised exception: Invalid management FQDN analytics1053.mgmt.eqiad.wmnet for analytics1053.eqiad.wmnet

ERROR: some step on some host failed, check the bolded items above

I fixed the homer error via https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/658565, and ran manually homer to set analytics1053's port as disabled. The decom can proceed :)

All of the servers have been removed from the racks, the netbox script was run and cookbook on cumin.