Page MenuHomePhabricator

decommission cloudelastic100[1-4].wikimedia.org
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of servers cloudelastic100[1-4].wikimedia.org

Note: Decom cookbooks were ran, but there was one small failure:cloudelastic1001 probably failed to fully wipe its disks, see https://phabricator.wikimedia.org/T357780#9561306 and/or https://phabricator.wikimedia.org/P57418:

wipefs: error: /dev/sdg: probing initialization failed: Read-only file system
wipefs: error: /dev/sdh: probing initialization failed: Read-only file system
wipefs: error: /dev/sdi: probing initialization failed: Read-only file system
/dev/sdj: 2 bytes were erased at offset 0x000001fe (dos): 55 aa
wipefs: error: /dev/sdj1: probing initialization failed: No such device or address
wipefs: error: /dev/sdk: probing initialization failed: Read-only file system
/dev/sdl: 2 bytes were erased at offset 0x000001fe (dos): 55 aa
wipefs: error: /dev/sdl1: probing initialization failed: No such device or address
**Failed to wipe swraid, partition-table and filesystem signatures, manual intervention required to make it unbootable**: Cumin execution failed (exit_code=2)

cloudelastic1001.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cloudelastic1002.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cloudelastic1003.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cloudelastic1004.wikimedia.org

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

RKemper updated the task description. (Show Details)
VRiley-WMF changed the task status from Open to In Progress.Feb 21 2024, 5:44 PM
VRiley-WMF updated the task description. (Show Details)

These servers have been unracked and ran the decommission script on them.

@VRiley-WMF In netbox I see cloudelastic1003 still listed as decommissioning, whereas the other cloudelastic hosts are marked as Offline. Is it just the step to set netbox status to Offline that we're missing or are there other steps that still need to be run on cloudelastic1003 as well?

@RKemper Thanks for bringing this up! I missed running the script for this device. It's been run and decommissioned.

@RKemper Thanks for bringing this up! I missed running the script for this device. It's been run and decommissioned.

Wow, blazing fast turnaround! Much appreciated :)