Page MenuHomePhabricator

Decommission hafnium
Closed, ResolvedPublic

Description

This checklist is able to be copied and pasted into phabricator hardware request tasks for reclaiming systems to spare or decom.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw2-c-eqiad:ge-4/0/4
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Imarlier created this task.

Changeset to remove service group/hiera/etc, and for site.pp: https://gerrit.wikimedia.org/r/#/c/429825/

I don't have permission to downtime services in icinga, unfortunately, so need someone to help me out with that.

Once that change is merged, perf will no longer have root, so we'll either need to coordinate to shut down prod services before puppet actually runs; or someone with global root will need to run these commands to disable the running services on the host:

  • sudo service navtiming stop
  • sudo service statsv stop

(Doing that now isn't going to help, since puppet will just restart them on next run.)

14:57 < mutante> [einsteinium:~] $ sudo icinga-downtime -h hafnium -r "T193420"

14:58 < mutante> marlier: downtimed in icinga, merging your change

15:01 < mutante> !log hafnium - sudo service navtiming stop; sudo service statsv stop - downtimed in icinga, decom

Change 429858 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP/partman: remove hafnium.eqiad.wmnet

https://gerrit.wikimedia.org/r/429858

Change 429858 merged by Dzahn:
[operations/puppet@production] DHCP/partman: remove hafnium.eqiad.wmnet

https://gerrit.wikimedia.org/r/429858

Dzahn removed Dzahn as the assignee of this task.Apr 30 2018, 7:14 PM
Dzahn edited projects, added ops-eqiad; removed Patch-For-Review.
Dzahn subscribed.

From here: @Cmjohnson you can continue on the ticket

RobH updated the task description. (Show Details)

Change 452014 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom hafnium prod dns entries

https://gerrit.wikimedia.org/r/452014

Change 452016 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] hafnium decom

https://gerrit.wikimedia.org/r/452016

Change 452014 merged by RobH:
[operations/dns@master] decom hafnium prod dns entries

https://gerrit.wikimedia.org/r/452014

Change 452016 merged by RobH:
[operations/puppet@production] hafnium decom

https://gerrit.wikimedia.org/r/452016

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (eqiad) on the decommission-hardware board.
RobH subscribed.

Change 454320 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for decom host hafnium

https://gerrit.wikimedia.org/r/454320

Change 454320 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns for decom host hafnium

https://gerrit.wikimedia.org/r/454320

Cmjohnson updated the task description. (Show Details)