Page MenuHomePhabricator

decommission mw2251-mw2255, mw2257-mw2258
Closed, ResolvedPublic

Description

With T290192 done, we should proceeding to decommissioning mw2251-2258

Service Owner steps

  • all system services confirmed offline from production use
  • set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • remove system from all lvs/pybal active configuration
  • any service group puppet/hiera/dsh config removed
  • remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, [] netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • remove all remaining puppet references and all host entries in the puppet repo
  • reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps

  • system disks removed (by onsite)
  • determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • IF DECOM: mgmt dns entries removed.
  • IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

Event Timeline

N.B. this is only seven hosts, mw225[1-5,7-8] -- mw2256 was already decommed in T263065.

RLazarus renamed this task from decomission mw2251-mw2258 to decomission mw2251-mw2255, mw2257-mw2258.Jul 27 2022, 5:51 PM

Change 817869 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] Decom mw2251-2255,2257,2258

https://gerrit.wikimedia.org/r/817869

Reviewing the change above I was looking at the mw225* range in Netbox:

https://netbox.wikimedia.org/search/?q=mw225&obj_type=

And i noticed this ticket and the Gerrit change handle mw2251 through mw2253 (rack A4) and mw2254 through mw2258 (rack B3) but ..mw2259 is not in the list..but ...after further review it _does_ actually make sense because from mw2259 upwards they are from a newer procurement ticket.

So unless we have concerns over capacity if we reduce by these 7 servers it all looks good to me.

Aklapper renamed this task from decomission mw2251-mw2255, mw2257-mw2258 to decommission mw2251-mw2255, mw2257-mw2258.Jul 27 2022, 9:53 PM

Icinga downtime and Alertmanager silence (ID=ed218867-a364-4f33-bbd9-60f66ba67f36) set by rzl@cumin2002 for 2:00:00 on 7 host(s) and their services with reason: Decom

mw[2251-2255,2257-2258].codfw.wmnet

cookbooks.sre.hosts.decommission executed by rzl@cumin2002 for hosts: mw[2251-2255,2257-2258].codfw.wmnet

  • mw2251.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2252.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2253.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2254.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2255.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2257.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • mw2258.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found physical host
    • Downtimed management interface on Icinga/Alertmanager
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
    • Configured the linked switch interface(s)
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 817869 merged by RLazarus:

[operations/puppet@production] Decom mw2251-2255,2257,2258

https://gerrit.wikimedia.org/r/817869

Mentioned in SAL (#wikimedia-operations) [2022-07-27T23:59:08Z] <tstarling@deploy1002> Synchronized wmf-config/InitialiseSettings.php: sync again now that scap proxy list is fixed T313730 T313496 (duration: 03m 25s)

RLazarus added a project: ops-codfw.
RLazarus updated the task description. (Show Details)
RLazarus subscribed.

@Papaul All yours!

Papaul triaged this task as Medium priority.Aug 1 2022, 2:58 PM
Papaul updated the task description. (Show Details)

complete