Page MenuHomePhabricator

decommission: labtestvirt200[12].codfw.wmnet
Closed, ResolvedPublic

Description

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Decommission note: These two hosts are both 5+ years old and thus are just slated for decommsion and disposal rather than returning to spares.

labtestvirt2001

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

labtestvirt2002

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

aborrero triaged this task as Medium priority.Mar 11 2019, 1:04 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero moved this task from Backlog to Decommission on the ops-codfw board.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-operations) [2019-03-11T13:28:02Z] <arturo> disable active checks in icinga for labtestvirt200[12] (T218023)

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

aborrero renamed this task from Hardware decommission: labtestvirt200[12].codfw.wmnet to decommission: labtestvirt200[12].codfw.wmnet.Mar 18 2019, 1:09 PM
aborrero updated the task description. (Show Details)
aborrero added subscribers: RobH, Papaul.

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

Change 497293 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-03-26T16:05:34Z] <robh> decom of labtestvirt200[12] started via T218023

asw-b-codfw:

ge-5/0/8 up down labtestvirt2002-eth0
ge-5/0/17 up down labtestvirt2001-eth0
ge-5/0/30 up down labtestvirt2002-eth1
ge-5/0/31 up down labtestvirt2001-eth1

wmf-decommission-host was executed by robh for labtestvirt2001.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for labtestvirt2002.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

Change 499235 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom of labtestvirt200[12]

https://gerrit.wikimedia.org/r/499235

Change 499235 merged by RobH:
[operations/puppet@production] decom of labtestvirt200[12]

https://gerrit.wikimedia.org/r/499235

Change 499237 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom labtestvirt200[12] prod dns

https://gerrit.wikimedia.org/r/499237

Change 499237 merged by RobH:
[operations/dns@master] decom labtestvirt200[12] prod dns

https://gerrit.wikimedia.org/r/499237

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission-hardware board.
This comment was removed by Papaul.
papaul@asw-b-codfw# run show interfaces ge-5/0/8 descriptions     
Interface       Admin Link Description
ge-5/0/8        down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/17 descriptions   
Interface       Admin Link Description
ge-5/0/17       down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/30 descriptions    
Interface       Admin Link Description
ge-5/0/30       down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/31 descriptions    
Interface       Admin Link Description
ge-5/0/31       down  down DISABLED

Change 500637 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2]

https://gerrit.wikimedia.org/r/500637

Change 500637 merged by Arturo Borrero Gonzalez:
[operations/dns@master] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2]

https://gerrit.wikimedia.org/r/500637

This is complete.