Page MenuHomePhabricator

Return graphite100[13] to spares pool (or decom)
Closed, ResolvedPublic

Description

These two machines are now unused and replaced by graphite1004, they can be returned to spares pool at the beginning of December (leaving some time to come back to them if graphite1004 for some reason doesn't work).

This task will track the decommission-hardware of servers graphite1001 & graphite1003. Both of these systems are out of warranty and need to have their return to spares or disposal approved. Their warranties expired in January and February of 2018, so they were purchased in 2015. They are now both over 4 years old. @RobH coordinated with @faidon via irc on 2019-02-14 and confirmed we should decommission these.

The first 5 steps should be completed by the service owner that is returning the server to DC-ops (for reclaim to spare or decommissioning, dependent on server configuration and age.)

graphite1001:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare::system if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw2-c-eqiad:ge-4/0/6
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

graphite1003:
Steps for service owner:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare::system if system isn't shut down immediately during this process.)
  • - unassign service owner from this task, check off completed steps, and assign to @RobH for followup on below steps.

Steps for DC-Ops:
The following steps cannot be interrupted, as it will leave the system in an unfinished state.
Start non-interrupt steps:

  • - disable puppet on host
  • - power down host
  • - update netbox status to Inventory (if decom) or Planned (if spare)
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw-a-eqiad:ge-3/0/15
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate (handled by wmf-decommission-host)
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key (handled by wmf-decommission-host)

End non-interrupt steps.

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 13 2018, 1:24 PM
colewhite triaged this task as Low priority.Nov 13 2018, 4:57 PM

Change 477269 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] site: use spare::system for old graphite hosts

https://gerrit.wikimedia.org/r/477269

Change 477269 merged by Filippo Giunchedi:
[operations/puppet@production] site: use spare::system for old graphite hosts

https://gerrit.wikimedia.org/r/477269

Change 477601 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] remove graphite1001, graphite1003 from DHCP

https://gerrit.wikimedia.org/r/477601

Change 477602 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch graphite host for dev_cluster from graphite1003 to graphite1004

https://gerrit.wikimedia.org/r/477602

Change 477604 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch graphite host for prod cassandra to graphite1004

https://gerrit.wikimedia.org/r/477604

Dzahn added a subscriber: Dzahn.Dec 4 2018, 6:22 PM

we should also switch the state of these spare graphite hosts in netbox

Change 477601 merged by Dzahn:
[operations/puppet@production] remove graphite1001, graphite1003 from DHCP

https://gerrit.wikimedia.org/r/477601

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.Dec 20 2018, 10:20 AM

Change 477602 merged by Dzahn:
[operations/puppet@production] switch graphite host for dev_cluster from graphite1003 to 'none'

https://gerrit.wikimedia.org/r/477602

Change 477604 merged by Dzahn:
[operations/puppet@production] switch graphite host for prod cassandra from graphite1003 to 'none'

https://gerrit.wikimedia.org/r/477604

fgiunchedi updated the task description. (Show Details)Dec 24 2018, 8:51 AM
fgiunchedi assigned this task to RobH.Dec 24 2018, 11:51 AM
fgiunchedi added a subscriber: RobH.

@RobH graphite100[13] confirmed ready for wipe/decom

fgiunchedi moved this task from Radar to Other on the User-fgiunchedi board.Jan 2 2019, 10:37 AM
RobH updated the task description. (Show Details)
RobH added a subscriber: faidon.
RobH updated the task description. (Show Details)Feb 14 2019, 6:30 PM
RobH updated the task description. (Show Details)Feb 14 2019, 6:35 PM
RobH updated the task description. (Show Details)Feb 14 2019, 6:42 PM

wmf-decommission-host was executed by robh for graphite1001.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

wmf-decommission-host was executed by robh for graphite1003.eqiad.wmnet and performed the following actions:

  • Revoked Puppet certificate
  • Removed from PuppetDB
  • Downtimed host on Icinga
  • Downtimed mgmt interface on Icinga
  • Removed from DebMonitor

Change 490669 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom graphite100[13] prod dns

https://gerrit.wikimedia.org/r/490669

Change 490672 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom graphite100[12]

https://gerrit.wikimedia.org/r/490672

Change 490672 merged by RobH:
[operations/puppet@production] decom graphite100[12]

https://gerrit.wikimedia.org/r/490672

Change 490669 merged by RobH:
[operations/dns@master] decom graphite100[13] prod dns

https://gerrit.wikimedia.org/r/490669

RobH reassigned this task from RobH to Cmjohnson.Feb 14 2019, 6:49 PM
RobH edited projects, added ops-eqiad; removed Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Decommission on the ops-eqiad board.
fgiunchedi moved this task from Other to Radar on the User-fgiunchedi board.Jul 1 2019, 12:51 PM
Cmjohnson added a subscriber: Cmjohnson.

please wipe these especially 1001 to make some space for ms-be servers

Jclark-ctr updated the task description. (Show Details)Oct 11 2019, 10:44 PM
Jclark-ctr updated the task description. (Show Details)Nov 1 2019, 10:38 PM
Papaul added a subscriber: Papaul.Nov 5 2019, 9:09 PM
papaul@asw2-c-eqiad# show | compare 
[edit interfaces]
-   ge-4/0/6 {
-       description graphite1001;
-   }
`
Papaul updated the task description. (Show Details)Nov 5 2019, 9:10 PM

Change 548888 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/dns@master] DNS: Remove mgmt DNS with graphite100[13]

https://gerrit.wikimedia.org/r/548888

Change 548888 merged by Papaul:
[operations/dns@master] DNS: Remove mgmt DNS with graphite100[13]

https://gerrit.wikimedia.org/r/548888

Papaul closed this task as Resolved.Nov 5 2019, 9:18 PM
Papaul updated the task description. (Show Details)

complete