Page MenuHomePhabricator

Q4:rack/decom codfw unified decommission task
Closed, ResolvedPublic

Description

lvs2007

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

lvs2008

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

lvs2009

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

lvs2010

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

dns

dns2001

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

dns2002

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

dns2003

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - mgmt dns entries removed.

Details

SubjectRepoBranchLines +/-
operations/dnsmaster+2 -0
operations/homer/publicmaster+0 -1
operations/puppetproduction+4 -13
operations/homer/publicmaster+0 -1
operations/puppetproduction+2 -10
operations/dnsmaster+2 -0
operations/puppetproduction+0 -23
operations/homer/publicmaster+0 -1
operations/puppetproduction+0 -7
operations/puppetproduction+1 -29
operations/homer/publicmaster+0 -1
operations/puppetproduction+1 -21
operations/homer/publicmaster+0 -1
operations/puppetproduction+9 -23
operations/homer/publicmaster+0 -1
operations/puppetproduction+2 -10
operations/homer/publicmaster+0 -1
operations/puppetproduction+2 -10
Show related patches Customize query in gerrit

Event Timeline

ssingh triaged this task as Low priority.May 2 2023, 1:55 PM

Change 914341 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs2007: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/914341

Change 914343 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool codfw (emergency patch, do not merge)

https://gerrit.wikimedia.org/r/914343

Change 914344 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors

https://gerrit.wikimedia.org/r/914344

Mentioned in SAL (#wikimedia-operations) [2023-05-03T14:33:17Z] <sukhe> set routing-options static route 208.80.153.224/28 next-hop 10.192.49.7 [move static route for high-traffic1 to lvs2010]: T335777

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs2007.codfw.wmnet

  • lvs2007.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 914341 merged by Ssingh:

[operations/puppet@production] lvs2007: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/914341

Change 914344 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2007 from lvs_neighbors

https://gerrit.wikimedia.org/r/914344

Mentioned in SAL (#wikimedia-operations) [2023-05-03T14:52:55Z] <sukhe> homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777

Mentioned in SAL (#wikimedia-operations) [2023-05-03T14:54:12Z] <sukhe> [finished] homer "cr*-codfw*" commit "Gerrit: 914344 remove decommissioned host lvs2007": T335777

@Jhancock.wm can you run the netbox offline script and get lvs2007 out of the rack and into storage ?
Thanks

ssingh updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-05-08T15:32:10Z] <sukhe> ns1: remove dns2001, add dns2004 next-hop [ 208.80.153.48 208.80.153.111 208.80.153.10 ]: T335777

Change 917364 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove dns2001 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/917364

Change 917365 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: decommission dns2001

https://gerrit.wikimedia.org/r/917365

Papaul moved this task from Backlog to Decommission on the ops-codfw board.

Change 917882 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs2008: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/917882

Change 917885 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2008 from lvs_neighbors

https://gerrit.wikimedia.org/r/917885

Mentioned in SAL (#wikimedia-operations) [2023-05-09T14:15:21Z] <sukhe> set routing-options static route 208.80.153.240/28 next-hop 10.192.49.7 [move static route for high-traffic2 to lvs2010]: T335777

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs2008.codfw.wmnet

  • lvs2008.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 917882 merged by Ssingh:

[operations/puppet@production] lvs2008: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/917882

Change 917885 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2008 from lvs_neighbors

https://gerrit.wikimedia.org/r/917885

BCornwall changed the task status from Open to In Progress.May 10 2023, 7:20 PM
BCornwall moved this task from Backlog to Traffic team actively servicing on the Traffic board.

Change 917365 abandoned by Ssingh:

[operations/puppet@production] hiera: decommission dns2001

Reason:

probably best to start clean

https://gerrit.wikimedia.org/r/917365

Change 919340 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: decommission dns2001

https://gerrit.wikimedia.org/r/919340

Change 917364 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove dns2001 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/917364

Mentioned in SAL (#wikimedia-operations) [2023-05-12T14:13:40Z] <sukhe> homer "cr*-codfw*" commit "Gerrit: 917364 remove to-be decommissioned host dns2001": T335777

Mentioned in SAL (#wikimedia-operations) [2023-05-12T14:15:55Z] <sukhe> [done] homer "cr*-codfw*" commit "Gerrit: 917364 remove to-be decommissioned host dns2001": T335777

Change 919340 merged by Ssingh:

[operations/puppet@production] hiera: decommission dns2001

https://gerrit.wikimedia.org/r/919340

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: dns2001.wikimedia.wmnet

  • dns2001.wikimedia.wmnet (FAIL)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: dns2001.wikimedia.org

  • dns2001.wikimedia.org (FAIL)
    • Unable to find/resolve the mgmt DNS record, using the IP instead: 10.193.1.197
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
    • Host is already powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Change 920320 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove dns2002 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/920320

Change 920350 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: decommission dns2002

https://gerrit.wikimedia.org/r/920350

Change 920320 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove dns2002 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/920320

Mentioned in SAL (#wikimedia-operations) [2023-05-16T17:00:13Z] <sukhe> homer "cr*-codfw*" commit "Gerrit: 920320 remove to-be decommissioned host dns2002" T335777

Change 920350 merged by Ssingh:

[operations/puppet@production] hiera: decommission dns2002

https://gerrit.wikimedia.org/r/920350

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: dns2002.wikimedia.org

  • dns2002.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 920355 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: remove obsolete dns2001.yaml file

https://gerrit.wikimedia.org/r/920355

Change 920355 merged by Ssingh:

[operations/puppet@production] hiera: remove obsolete dns2001.yaml file

https://gerrit.wikimedia.org/r/920355

Change 920363 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove dns2003 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/920363

Change 920364 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] hiera: decommission dns2003

https://gerrit.wikimedia.org/r/920364

Change 920363 merged by jenkins-bot:

[operations/homer/public@master] sites.yaml: remove dns2003 from anycast_neighbors (host decom)

https://gerrit.wikimedia.org/r/920363

Mentioned in SAL (#wikimedia-operations) [2023-05-16T18:46:42Z] <sukhe> homer "cr*-codfw*" commit "Gerrit: 920363 remove to-be decommissioned host dns2003": T335777

Change 920364 merged by Ssingh:

[operations/puppet@production] hiera: decommission dns2003

https://gerrit.wikimedia.org/r/920364

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: dns2003.wikimedia.org

  • dns2003.wikimedia.org (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

@ssingh is it safe for me to physically remove lvs2008 and the three dns servers from the racks and offline them?

@ssingh is it safe for me to physically remove lvs2008 and the three dns servers from the racks and offline them?

Yes thanks! It's safe to remove lvs2008 and also dns200[1-3]. All hosts have been decommissioned from our end and are not in production.

Change 914343 abandoned by Ssingh:

[operations/dns@master] depool codfw (emergency patch, do not merge)

Reason:

no longer required

https://gerrit.wikimedia.org/r/914343

Change 927206 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs2009: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/927206

Change 927208 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors

https://gerrit.wikimedia.org/r/927208

Change 927214 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/dns@master] depool codfw (emergency patch, do not merge)

https://gerrit.wikimedia.org/r/927214

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs2009.codfw.wmnet

  • lvs2009.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 927208 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2009 from lvs_neighbors

https://gerrit.wikimedia.org/r/927208

Change 927206 merged by Ssingh:

[operations/puppet@production] lvs2009: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/927206

Mentioned in SAL (#wikimedia-operations) [2023-06-05T14:48:55Z] <sukhe> homer "cr*-codfw*" commit "Gerrit: 927208 remove decommissioned host lvs2009": T335777

Change 928067 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/puppet@production] lvs2010: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/928067

Change 928068 had a related patch set uploaded (by Ssingh; author: Ssingh):

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2010 from lvs_neighbors

https://gerrit.wikimedia.org/r/928068

cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: lvs2010.codfw.wmnet

  • lvs2010.codfw.wmnet (WARN)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Management interface not found on Icinga, unable to downtime it
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc)
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 928067 merged by Ssingh:

[operations/puppet@production] lvs2010: decommission host for codfw hardware refresh

https://gerrit.wikimedia.org/r/928067

Change 928068 merged by Ssingh:

[operations/homer/public@master] sites.yaml: remove decommissioned host lvs2010 from lvs_neighbors

https://gerrit.wikimedia.org/r/928068

Cable IDs
em1 - 11995
em2 - 11997
nic2 port 1 - 11996
nic2 port 2 - 11998

Jhancock.wm updated the task description. (Show Details)

servers have been removed from the racks but left in the hot aisle of row D. they will be moved to storage after the recycling pickup.

Change 927214 abandoned by Ssingh:

[operations/dns@master] depool codfw (emergency patch, do not merge)

Reason:

no longer required

https://gerrit.wikimedia.org/r/927214