Page MenuHomePhabricator

decommission cirrussearch1053.eqiad.wmnet + more (see description)
Closed, ResolvedPublicRequest

Description

This task will track the decommission-hardware of servers:

  • cirrussearch1053.eqiad.wmnet
  • elastic1054.eqiad.wmnet
  • cirrussearch1055.eqiad.wmnet
  • cirrussearch1056.eqiad.wmnet
  • cirrussearch1057.eqiad.wmnet
  • cirrussearch1058.eqiad.wmnet
  • cirrussearch1059.eqiad.wmnet
  • cirrussearch1060.eqiad.wmnet
  • cirrussearch1061.eqiad.wmnet
  • cirrussearch1062.eqiad.wmnet
  • cirrussearch1063.eqiad.wmnet
  • cirrussearch1064.eqiad.wmnet
  • cirrussearch1065.eqiad.wmnet
  • cirrussearch1066.eqiad.wmnet
  • elastic1067.eqiad.wmnet

cirrussearch1053.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

elastic1054.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1055.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1056.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1057.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1058.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1059.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1060.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1061.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1062.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1063.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1064.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1065.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

cirrussearch1066.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

elastic1067.eqiad.wmnet

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to no owner and ensure the site project (ops-sitename depending on site of server) is assigned.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Change #1146041 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic/cirrussearch: prepare hosts for decommission

https://gerrit.wikimedia.org/r/1146041

bking renamed this task from decommission cirrussearch1053, elastic1054, cirrussearch1055 to decommission cirrussearch1053, elastic1054, cirrussearch1055, cirrussearch1056.May 14 2025, 7:49 PM
bking updated the task description. (Show Details)
bking renamed this task from decommission cirrussearch1053, elastic1054, cirrussearch1055, cirrussearch1056 to decommission cirrussearch1053, elastic1054, cirrussearch1055, cirrussearch1056, elastic1057.May 19 2025, 3:55 PM
bking renamed this task from decommission cirrussearch1053, elastic1054, cirrussearch1055, cirrussearch1056, elastic1057 to decommission cirrussearch1053.eqiad.wmnet,elastic1054.eqiad.wmnet,cirrussearch1055.eqiad.wmnet,cirrussearch1055.eqiad.wmnet,cirrussearch1057.eqiad.wmnet,cirrussearch1058.eqiad.wmnet,cirrussearch1059.eqiad.wmnet,cirrussearch1063.eqiad.wmnet,elastic1067.eqiad.wmnet,cirrussearch1103.eqiad.wmnet.May 27 2025, 6:51 PM
bking renamed this task from decommission cirrussearch1053.eqiad.wmnet,elastic1054.eqiad.wmnet,cirrussearch1055.eqiad.wmnet,cirrussearch1055.eqiad.wmnet,cirrussearch1057.eqiad.wmnet,cirrussearch1058.eqiad.wmnet,cirrussearch1059.eqiad.wmnet,cirrussearch1063.eqiad.wmnet,elastic1067.eqiad.wmnet,cirrussearch1103.eqiad.wmnet to decommission cirrussearch1053.eqiad.wmnet + more (see description).
bking updated the task description. (Show Details)

Change #1151294 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: add row F, remove soon-to-be-decom hosts

https://gerrit.wikimedia.org/r/1151294

Mentioned in SAL (#wikimedia-operations) [2025-05-27T19:11:58Z] <inflatador> bking@cumin2002 depool cirrussearch106[0-6] T394350

Icinga downtime and Alertmanager silence (ID=855137f1-2f92-41f8-8d3f-a77b56d226b3) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: downtime until decom

elastic1067.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=1bdf7390-1c6f-4c94-941b-d1c183818dfd) set by bking@cumin2002 for 7 days, 0:00:00 on 1 host(s) and their services with reason: downtime until decom

elastic1054.eqiad.wmnet

Change #1151294 merged by Bking:

[operations/puppet@production] cirrussearch: add row F, remove soon-to-be-decom hosts

https://gerrit.wikimedia.org/r/1151294

Mentioned in SAL (#wikimedia-operations) [2025-05-28T15:17:33Z] <inflatador> bking@mwmaint1002 ban Elastic/CS hosts prior to decom T394350

Icinga downtime and Alertmanager silence (ID=aa7dc63c-8df3-4d64-9e10-00afae6d3783) set by bking@cumin2002 for 7 days, 0:00:00 on 3 host(s) and their services with reason: downtime until decom

elastic[1054,1067,1103].eqiad.wmnet

Change #1151784 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic/cirrussearch: prepare hosts for decom

https://gerrit.wikimedia.org/r/1151784

Change #1151784 merged by Bking:

[operations/puppet@production] elastic/cirrussearch: prepare hosts for decom

https://gerrit.wikimedia.org/r/1151784

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic1054.eqiad.wmnet

  • elastic1054.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: cirrussearch1053.eqiad.wmnet

  • cirrussearch1053.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: cirrussearch[1060-1062].eqiad.wmnet

  • cirrussearch1060.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1061.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1062.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: cirrussearch[1055-1059].eqiad.wmnet

  • cirrussearch1055.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1056.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1057.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1058.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1059.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: cirrussearch[1064-1066].eqiad.wmnet

  • cirrussearch1064.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1065.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • cirrussearch1066.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: elastic1067.eqiad.wmnet

  • elastic1067.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change #1152775 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic/cirrussearch: remove decomm'd hosts

https://gerrit.wikimedia.org/r/1152775

Change #1152775 merged by Ryan Kemper:

[operations/puppet@production] elastic/cirrussearch: remove decomm'd hosts

https://gerrit.wikimedia.org/r/1152775

bking removed bking as the assignee of this task.Jun 2 2025, 9:28 PM
bking updated the task description. (Show Details)
bking added a project: ops-eqiad.
bking added a subscriber: RKemper.

Data Platform SRE steps are finished (we think). Sending to DC Ops.

DC Ops, if we missed anything, feel free to ping me or @RKemper on IRC (inflatador and ryankemper, respectively).

Mentioned in SAL (#wikimedia-operations) [2025-06-09T17:21:13Z] <inflatador> bking@cumin1003 power down cirrussearch1063 to prevent logspam T394350

VRiley-WMF subscribed.

@bking It looks like cirrussearch1063 is saying it's still active, would you be able to put this into the decom status in netbox?

cookbooks.sre.hosts.decommission executed by bking@cumin2002 for hosts: cirrussearch1063.eqiad.wmnet

  • cirrussearch1063.eqiad.wmnet (FAIL)
    • Host not found on Icinga, unable to downtime it
    • Found physical host
    • Downtimed management interface on Alertmanager
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Failed to power off, manual intervention required: Remote IPMI for cirrussearch1063.mgmt.eqiad.wmnet failed (exit=1): b''
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

@VRiley-WMF Apologies, as it looks like we forgot to run the decom cookbook against cirrussearch1063. I just ran the cookbook and I can confirm that the host is now in decommissioning status . Again, sorry for missing that. If you need anything else, feel free to ping here or in IRC.

VRiley-WMF claimed this task.
VRiley-WMF updated the task description. (Show Details)

This is complete

Change #1146041 abandoned by Bking:

[operations/puppet@production] elastic/cirrussearch: prepare hosts for decommission

Reason:

We've already addressed this, see T394350

https://gerrit.wikimedia.org/r/1146041

Jclark-ctr subscribed.

@VRiley-WMF elastic1067 is still listed in Netbox in rack please Verify and update

@Jclark-ctr verified it was removed and the script has been run