Page MenuHomePhabricator

reclaim to spares pool: restbase2007.codfw.wmnet, restbase2008.codfw.wmnet
Closed, ResolvedPublicRequest

Description

This task will track the decommission of servers restbase2007.codfw.wmnet and restbase2008.codfw.wmnet.

@RobH notes that these are actually still in warranty in netbox until April 22, 2019, so we'll reclaim to spares rather than dispose of these.

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

restbase2007.codfw.wmnet
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - switch port description changed to asset tag
  • - mgmt dns entries updated to just asset tag.
  • - netbox name updated to just asset tag

restbase2007.codfw.wmnet
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - switch port description changed to asset tag
  • - mgmt dns entries updated to just asset tag.
  • - netbox name updated to just asset tag

Event Timeline

Dzahn created this task.Apr 16 2019, 7:18 PM
Dzahn updated the task description. (Show Details)
Dzahn added a comment.Apr 16 2019, 7:24 PM

This has been started as part of T208087 and the change above has been merged so they are not in prod anymore. This is the decom ticket with the template to go with it for the next steps.

Dzahn updated the task description. (Show Details)Apr 16 2019, 7:31 PM
Dzahn added a subscriber: mobrovac.
Dzahn added a subscriber: Eevans.
Dzahn assigned this task to RobH.Apr 16 2019, 7:51 PM
Dzahn updated the task description. (Show Details)

confirmed in grafana both hosts did not get any network traffic anymore.

downtimed in Icinga:

[icinga1001:~] $ sudo icinga-downtime -h restbase2007 -r https://phabricator.wikimedia.org/T221134 -d 345600
[icinga1001:~] $ sudo icinga-downtime -h restbase2008 -r https://phabricator.wikimedia.org/T221134 -d 345600
[icinga1001:~] $

using role::spare::system ... assigning to Rob

Dzahn updated the task description. (Show Details)Apr 16 2019, 7:52 PM
RobH added a comment.Apr 23 2019, 5:58 PM

restbase2007:asw-b-codfw:ge-1/0/2
restbase2008:asw-c-codfw:ge-1/0/2

yeah, they are in the same port on two different rows, not a mistake.

RobH updated the task description. (Show Details)Apr 23 2019, 6:02 PM

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: restbase2007.codfw.wmnet

  • restbase2007.codfw.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor

cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: restbase2008.codfw.wmnet

  • restbase2008.codfw.wmnet
    • Removed from Puppet master and PuppetDB
    • Downtimed host on Icinga
    • Downtimed management interface on Icinga
    • Removed from DebMonitor
RobH updated the task description. (Show Details)Apr 23 2019, 6:03 PM

Change 505849 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] restbase200[78] decom

https://gerrit.wikimedia.org/r/505849

Change 505850 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] restbase200[78] prod dns decom

https://gerrit.wikimedia.org/r/505850

Change 505849 merged by RobH:
[operations/puppet@production] restbase200[78] decom

https://gerrit.wikimedia.org/r/505849

Change 505850 merged by RobH:
[operations/dns@master] restbase200[78] prod dns decom

https://gerrit.wikimedia.org/r/505850

RobH renamed this task from decommission restbase2007.codfw.wmnet, restbase2008.codfw.wmnet to reclaim to spares pool: restbase2007.codfw.wmnet, restbase2008.codfw.wmnet.Apr 23 2019, 6:08 PM
RobH updated the task description. (Show Details)
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH added a subscriber: Papaul.

Ok, these are ready to have their SSDs securely erased by @Papaul, then the remaining checkboxes. This will return them to the 'spares pool' which is denoted now by being 'planned' status in netbox but with an asset tag as a hostname.

RobH reassigned this task from RobH to Papaul.Apr 23 2019, 6:25 PM
Papaul claimed this task.May 3 2019, 6:36 PM
Papaul added a comment.May 3 2019, 7:14 PM
papaul@asw-c-codfw# run show interfaces ge-1/0/2 descriptions 
Interface       Admin Link Description
ge-1/0/2        down  down DISABLED
Papaul updated the task description. (Show Details)May 16 2019, 3:31 PM

Change 510776 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt hostname for restbase200[1-8]

https://gerrit.wikimedia.org/r/510776

Change 510776 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt hostname for restbase200[1-8]

https://gerrit.wikimedia.org/r/510776

Papaul closed this task as Resolved.May 16 2019, 8:31 PM
Papaul updated the task description. (Show Details)

Complete