Page MenuHomePhabricator

Decommission hafnium
Closed, ResolvedPublic

Description

This checklist is able to be copied and pasted into phabricator hardware request tasks for reclaiming systems to spare or decom.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw2-c-eqiad:ge-4/0/4
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate
  • - remove dbmonitor entries on neodymium/sarin: sudo curl -X DELETE https://debmonitor.discovery.wmnet/hosts/${HOST_FQDN} --cert /etc/debmonitor/ssl/cert.pem --key /etc/debmonitor/ssl/server.key

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - IF DECOM: system unracked and decommissioned (by onsite), update racktables with result
  • - IF DECOM: switch port configration removed from switch once system is unracked.
  • - IF DECOM: add system to decommission tracking google sheet
  • - IF DECOM: mgmt dns entries removed.

Event Timeline

Imarlier triaged this task as Normal priority.Apr 30 2018, 6:27 PM
Imarlier created this task.
Imarlier updated the task description. (Show Details)Apr 30 2018, 6:31 PM

Changeset to remove service group/hiera/etc, and for site.pp: https://gerrit.wikimedia.org/r/#/c/429825/

I don't have permission to downtime services in icinga, unfortunately, so need someone to help me out with that.

Once that change is merged, perf will no longer have root, so we'll either need to coordinate to shut down prod services before puppet actually runs; or someone with global root will need to run these commands to disable the running services on the host:

  • sudo service navtiming stop
  • sudo service statsv stop

(Doing that now isn't going to help, since puppet will just restart them on next run.)

Dzahn claimed this task.Apr 30 2018, 6:53 PM
Imarlier updated the task description. (Show Details)Apr 30 2018, 6:54 PM
Dzahn updated the task description. (Show Details)Apr 30 2018, 6:57 PM
Dzahn added a comment.Apr 30 2018, 7:02 PM

14:57 < mutante> [einsteinium:~] $ sudo icinga-downtime -h hafnium -r "T193420"

14:58 < mutante> marlier: downtimed in icinga, merging your change

15:01 < mutante> !log hafnium - sudo service navtiming stop; sudo service statsv stop - downtimed in icinga, decom

Dzahn updated the task description. (Show Details)Apr 30 2018, 7:02 PM

Change 429858 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP/partman: remove hafnium.eqiad.wmnet

https://gerrit.wikimedia.org/r/429858

Change 429858 merged by Dzahn:
[operations/puppet@production] DHCP/partman: remove hafnium.eqiad.wmnet

https://gerrit.wikimedia.org/r/429858

Dzahn removed Dzahn as the assignee of this task.Apr 30 2018, 7:14 PM
Dzahn edited projects, added ops-eqiad; removed Patch-For-Review.
Dzahn added a subscriber: Dzahn.
Imarlier moved this task from Inbox to Radar on the Performance-Team board.Apr 30 2018, 7:15 PM
Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.

From here: @Cmjohnson you can continue on the ticket

Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.May 2 2018, 2:33 PM
Vvjjkkii renamed this task from Decommission hafnium to jydaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
CommunityTechBot renamed this task from jydaaaaaaa to Decommission hafnium.Jul 2 2018, 2:57 PM
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: Aklapper, gerritbot.
RobH claimed this task.Aug 10 2018, 10:14 PM
RobH updated the task description. (Show Details)
RobH updated the task description. (Show Details)Aug 10 2018, 10:17 PM

Change 452014 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom hafnium prod dns entries

https://gerrit.wikimedia.org/r/452014

Change 452016 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] hafnium decom

https://gerrit.wikimedia.org/r/452016

Change 452014 merged by RobH:
[operations/dns@master] decom hafnium prod dns entries

https://gerrit.wikimedia.org/r/452014

Change 452016 merged by RobH:
[operations/puppet@production] hafnium decom

https://gerrit.wikimedia.org/r/452016

RobH reassigned this task from RobH to Cmjohnson.Aug 10 2018, 10:38 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to pending onsite steps (eqiad) on the decommission board.
RobH added a subscriber: RobH.

Change 454320 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt dns for decom host hafnium

https://gerrit.wikimedia.org/r/454320

Change 454320 merged by Cmjohnson:
[operations/dns@master] Removing mgmt dns for decom host hafnium

https://gerrit.wikimedia.org/r/454320

Cmjohnson closed this task as Resolved.Aug 21 2018, 5:08 PM
Cmjohnson updated the task description. (Show Details)