Page MenuHomePhabricator

decommission cloudservices1003.wikimedia..org
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of server cloudservices1003.wikimedia..org.

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

cloudservices1003.wikimedia..org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member and site project (ops-sitename) depending on site of server

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Change 826645 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Remove refs to cloudservices1003

https://gerrit.wikimedia.org/r/826645

Change 826645 merged by Andrew Bogott:

[operations/puppet@production] Remove refs to cloudservices1003

https://gerrit.wikimedia.org/r/826645

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003

  • cloudservices1003 (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Andrew updated the task description. (Show Details)
Andrew added a project: ops-eqiad.

@Cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately. Thanks!

@Cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately. Thanks!

The host was not reachable by cumin at the time of decom, see T316292#8189015.

It seems that the runbook did not cleanup puppetdb or it was repopulated right after, as the host still shows there:

https://debmonitor.wikimedia.org/hosts/cloudservices1003.wikimedia.org
https://puppetboard.wikimedia.org/node/cloudservices1003.wikimedia.org

You can see that the time is during the decom.

This is triggering some alerts:
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netbox1002&service=Netbox+report+puppetdb_physical
Details of the report:
https://netbox.wikimedia.org/extras/reports/results/3712993/

So we should figure out how to clean those up (thanks @ayounsi for the notice on irc)

The decom cookbook is meant to be idempotent, so you can safely re-run it. That said I can look next week on the logs of the previous run to check if something went obviously wrong. For example if the host wasn't powered off that could explain the re-appearance to puppetdb and debmonitor.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003

  • cloudservices1003 (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

It looks like the decom script can't find it in puppetdb even though the alert says it is in puppetdb.

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003.wikimedia.org

  • cloudservices1003.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: cloudservices1003.wikimedia.org

  • cloudservices1003.wikimedia.org (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • No DNS record found for the mgmt interface cloudservices1003.mgmt.eqiad.wmnet, trying the asset tag one: wmf7225.mgmt.eqiad.wmnet
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Andrew added a subscriber: Cmjohnson.

I am now officially out of ideas :) Over to you, @Volans!

Volans removed Volans as the assignee of this task.Sep 5 2022, 9:44 AM

@Andrew what is the issue that you're still seeing? It looks good to me.

I see that the host is correctly powered off:

>>> i.power_status()
INFO:spicerack.ipmi:Running IPMI command: ipmitool -I lanplus -H wmf7225.mgmt.eqiad.wmnet -U root -E chassis power status
'off'

It's not present, as expected, in:

and it's marked as decommissioning in Netbox: https://netbox.wikimedia.org/dcim/devices/1596/

Jclark-ctr claimed this task.
Jclark-ctr updated the task description. (Show Details)

Finished Decom process