Page MenuHomePhabricator

decommission druid100[7-8].eqiad.wmnet
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of server druid1007eqiad.wmnet

With the launch of updates to the decom cookbook, the majority of these steps can be handled by the service owners directly. The DC Ops team only gets involved once the system has been fully removed from service and powered down by the decommission cookbook.

druid1007.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

druid1008.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Change #1185840 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] druid: remove druid100[7-8] from druid_public_broker VIP

https://gerrit.wikimedia.org/r/1185840

The new druid hosts are online, proceeding with the decommissioning of the older hosts.

Mentioned in SAL (#wikimedia-analytics) [2025-09-24T11:40:57Z] <stevemunene> depool druid100[7-8] from the druid public cluster T403801

Mentioned in SAL (#wikimedia-analytics) [2025-09-24T11:44:17Z] <stevemunene> start decommissioning druid100[7-8] from the druid coordinator UI T403801

Change #1192147 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] remove mention of druid10[07-08] in puppet

https://gerrit.wikimedia.org/r/1192147

Mentioned in SAL (#wikimedia-analytics) [2025-09-29T15:06:12Z] <stevemunene> stop and disable druid services on druid100[7-8] T403801

Change #1185840 merged by Stevemunene:

[operations/puppet@production] druid: remove druid100[7-8] from druid_public_broker VIP

https://gerrit.wikimedia.org/r/1185840

Waiting on T405446 to be completed (that task contains the preliminary steps that need to be done by DPE SRE before handing over to DC Ops).

Icinga downtime and Alertmanager silence (ID=cd29316d-7828-4fe8-b8f0-b0d5f61f3fe3) set by brouberol@cumin1003 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Hosts are being decomissioned

an-druid1007.eqiad.wmnet

cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: druid1007.eqiad.wmnet

  • druid1007.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: druid1008.eqiad.wmnet

  • druid1008.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Change #1192147 merged by Stevemunene:

[operations/puppet@production] remove mention of druid10[07-08] in puppet

https://gerrit.wikimedia.org/r/1192147

cookbooks.sre.hosts.decommission executed by btullis@cumin1003 for hosts: druid1008.eqiad.wmnet

  • druid1008.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook, run it manually

ERROR: some step on some host failed, check the bolded items above

Running the dns cookbook manually for this

Stevemunene updated the task description. (Show Details)
Stevemunene added a project: ops-eqiad.
VRiley-WMF updated the task description. (Show Details)

This has been completed