Page MenuHomePhabricator

Decom ms-be[1019-1026]
Closed, ResolvedPublic

Description

ms-be1019

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1020

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

ms-be1021

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1022

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1023

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1024

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1025

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.
  • - IF RECLAIM: set netbox state to 'inventory' and hostname to asset tag

ms-be1026

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place. (likely done by script)
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system) recommended to ensure services offline but not 100% required as long as the decom script is IMMEDIATELY run below.
  • - login to cumin host and run the decom cookbook: cookbook sre.hosts.decommission <host fqdn> -t <phab task>. This does: bootloader wipe, host power down, netbox update to decommissioning status, puppet node clean, puppet node deactivate, debmonitor removal, and run homer.
  • - remove all remaining puppet references and all host entries in the puppet repo
  • - reassign task from service owner to DC ops team member depending on site of server.

End service owner steps / Begin DC-Ops team steps:

  • - system disks removed (by onsite)
  • - determine system age, under 5 years are reclaimed to spare, over 5 years are decommissioned.
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result and set state to offline
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2021-02-08T13:54:53Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-02-09T09:22:34Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-02-09T16:14:00Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-02-10T08:19:05Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-02-15T13:06:54Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-02-16T08:37:25Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-03-11T14:07:28Z] <godog> swift eqiad-prod: decrease weight for SSDs on ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-03-15T08:33:42Z] <godog> swift eqiad-prod remove decom hosts from account/container rings - T272836 T276193

Mentioned in SAL (#wikimedia-operations) [2021-03-16T07:51:09Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-03-17T07:50:14Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-03-18T08:27:48Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-03-22T08:13:27Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435

Mentioned in SAL (#wikimedia-operations) [2021-03-29T07:42:05Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435

Mentioned in SAL (#wikimedia-operations) [2021-04-06T07:20:49Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 T268435

Mentioned in SAL (#wikimedia-operations) [2021-04-14T07:41:10Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-04-19T08:07:19Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-04-26T07:32:56Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-04-27T07:26:57Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836

Mentioned in SAL (#wikimedia-operations) [2021-04-27T10:18:35Z] <godog> swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836

Change 682920 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Decom ms-be[1019-1026]

https://gerrit.wikimedia.org/r/682920

Change 682920 merged by Filippo Giunchedi:

[operations/puppet@production] Decom ms-be[1019-1026]

https://gerrit.wikimedia.org/r/682920

fgiunchedi renamed this task from Decom ms-be[1019-1026] from swift to Decom ms-be[1019-1026].Apr 27 2021, 12:47 PM
fgiunchedi updated the task description. (Show Details)
fgiunchedi updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: ms-be1019.eqiad.wmnet

  • ms-be1019.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by filippo@cumin1001 for hosts: ms-be[1020-1026].eqiad.wmnet

  • ms-be1020.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1021.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1022.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1023.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1024.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1025.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
  • ms-be1026.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped bootloaders
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
fgiunchedi mentioned this in Unknown Object (Task).Jun 29 2021, 8:30 PM

Removed ms-be1026 to rack new host

Cmjohnson updated the task description. (Show Details)

removed from rack and updated