Page MenuHomePhabricator

Decommission mw1196
Closed, ResolvedPublic

Description

mw1196 failed to recover from the reboot. Several attempts were made to power off, drain flea power reseat DIMM. After talking with Faidon, these servers are on the short list to be decommissioned already so it's been decided to decom this server sooner.

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) asw-c-eqiad:ge-6/0/35
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 12 2017, 5:00 PM

Change 364918 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom mw1196

https://gerrit.wikimedia.org/r/364918

RobH updated the task description. (Show Details)

In the future, before starting any actual decommission steps, the checklist MUST be populated in the task description. This ensures we follow the proper procedure (which has been an issue in the past.) Additionally, all decoms should have hardware-requests project tagged in.

I've corrected the above by adding the checklist as well as the project.

Change 364918 abandoned by Dzahn:
decom mw1196

https://gerrit.wikimedia.org/r/364918

RobH claimed this task.Jul 12 2017, 11:44 PM
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Reclaim (Spares/Decommission) on the hardware-requests board.
Dzahn removed a subscriber: Dzahn.Jul 12 2017, 11:46 PM
RobH moved this task from Backlog to Not urgent on the ops-eqiad board.Jul 18 2017, 9:31 PM

Change 366284 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decommission of mw1196

https://gerrit.wikimedia.org/r/366284

Change 366285 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom of mw1196

https://gerrit.wikimedia.org/r/366285

RobH updated the task description. (Show Details)Jul 19 2017, 4:43 PM

Change 366284 merged by RobH:
[operations/puppet@production] decommission of mw1196

https://gerrit.wikimedia.org/r/366284

Change 366285 merged by RobH:
[operations/dns@master] decom of mw1196

https://gerrit.wikimedia.org/r/366285

RobH reassigned this task from RobH to Cmjohnson.Jul 19 2017, 4:45 PM
RobH triaged this task as Medium priority.

This system is ready for disk wipe and decom step remainder.

Cmjohnson moved this task from Not urgent to Decommission on the ops-eqiad board.Jul 20 2017, 3:23 PM

Have the "non-interruptuable steps" really been completed? mw1196 still has a salt key and shows up https://servermon.wikimedia.org/hosts/

RobH added a comment.Jul 26 2017, 3:58 PM

Have the "non-interruptuable steps" really been completed? mw1196 still has a salt key and shows up https://servermon.wikimedia.org/hosts/

I thought they were, but the main non-interrupt is the switch port disable, which I've reconfirmed is done.

I've gone ahead and reviewed all the steps a second time to ensure completion. Thanks for catching this!

The puppet node clean/deactivate is confirmed done (I've just done so) as well as re-confirming switch port deactivation as well as the other steps I took.

Cmjohnson closed this task as Resolved.Mar 28 2018, 5:52 PM
Cmjohnson updated the task description. (Show Details)