Page MenuHomePhabricator

decommission: labtestvirt200[12].codfw.wmnet
Closed, ResolvedPublic

Description

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

Decommission note: These two hosts are both 5+ years old and thus are just slated for decommsion and disposal rather than returning to spares.

labtestvirt2001

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

labtestvirt2002

Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp, replace with role(spare::system)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:

The following steps cannot be interrupted, as it will leave the system in an unfinished state.

Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update netbox with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added a project: Operations. · View Herald TranscriptMar 11 2019, 12:21 PM
aborrero triaged this task as Medium priority.Mar 11 2019, 1:04 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)
aborrero moved this task from Backlog to Decommission on the ops-codfw board.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero updated the task description. (Show Details)Mar 11 2019, 1:26 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-11T13:28:02Z] <arturo> disable active checks in icinga for labtestvirt200[12] (T218023)

aborrero updated the task description. (Show Details)Mar 11 2019, 1:29 PM

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

aborrero renamed this task from Hardware decommission: labtestvirt200[12].codfw.wmnet to decommission: labtestvirt200[12].codfw.wmnet.Mar 18 2019, 1:09 PM
aborrero updated the task description. (Show Details)
aborrero added subscribers: RobH, Papaul.

Change 497293 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

Change 497293 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] wmcs: decommision several codfw servers

https://gerrit.wikimedia.org/r/497293

aborrero reassigned this task from aborrero to RobH.Mar 21 2019, 5:11 PM
aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2019-03-26T16:05:34Z] <robh> decom of labtestvirt200[12] started via T218023

RobH added a comment.Mar 26 2019, 4:06 PM

asw-b-codfw:

ge-5/0/8 up down labtestvirt2002-eth0
ge-5/0/17 up down labtestvirt2001-eth0
ge-5/0/30 up down labtestvirt2002-eth1
ge-5/0/31 up down labtestvirt2001-eth1

wmf-decommission-host was executed by robh for labtestvirt2001.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for labtestvirt2002.codfw.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor
RobH updated the task description. (Show Details)

Change 499235 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom of labtestvirt200[12]

https://gerrit.wikimedia.org/r/499235

Change 499235 merged by RobH:
[operations/puppet@production] decom of labtestvirt200[12]

https://gerrit.wikimedia.org/r/499235

Change 499237 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom labtestvirt200[12] prod dns

https://gerrit.wikimedia.org/r/499237

Change 499237 merged by RobH:
[operations/dns@master] decom labtestvirt200[12] prod dns

https://gerrit.wikimedia.org/r/499237

RobH reassigned this task from RobH to Papaul.Mar 26 2019, 5:47 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (codfw) on the decommission board.
This comment was removed by Papaul.
Papaul updated the task description. (Show Details)Apr 1 2019, 11:49 PM
papaul@asw-b-codfw# run show interfaces ge-5/0/8 descriptions     
Interface       Admin Link Description
ge-5/0/8        down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/17 descriptions   
Interface       Admin Link Description
ge-5/0/17       down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/30 descriptions    
Interface       Admin Link Description
ge-5/0/30       down  down DISABLED

{master:2}[edit]
papaul@asw-b-codfw# run show interfaces ge-5/0/31 descriptions    
Interface       Admin Link Description
ge-5/0/31       down  down DISABLED
Papaul updated the task description. (Show Details)Apr 2 2019, 12:02 AM

Change 500637 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2]

https://gerrit.wikimedia.org/r/500637

Papaul updated the task description. (Show Details)Apr 2 2019, 12:09 AM

Change 500637 merged by Arturo Borrero Gonzalez:
[operations/dns@master] DNS: Remove mgmt and production DNS entries for labtestvirt200[1-2]

https://gerrit.wikimedia.org/r/500637

Papaul closed this task as Resolved.Apr 2 2019, 2:37 PM

This is complete.