Page MenuHomePhabricator

Decommission stat1003.eqiad.wmnet
Closed, ResolvedPublic

Description

This checklist is able to be copied and pasted into phabricator hardware request tasks for reclaiming systems to spare or decom.

  • - all system services confirmed offline from production use (T173094)
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/hiera/dsh config removed
  • - remove site.pp (replace with role::spare if system isn't shut down immediately during this process.)

We already migrated all the data to stat1006 (home dirs, etc...) so in theory we could proceed straight away with the decom, but I'd wait up to Monday (Sept 11th) before starting to allow any last minute request from the old stat1003 users.

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw-c-eqiad:ge-4/0/19
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.

Event Timeline

elukey created this task.Sep 6 2017, 12:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2017, 12:33 PM
elukey updated the task description. (Show Details)Sep 6 2017, 12:38 PM
Dzahn triaged this task as Medium priority.Sep 7 2017, 2:37 PM
Cmjohnson moved this task from Backlog to Decommission on the ops-eqiad board.Sep 7 2017, 4:01 PM
Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Sep 14 2017, 4:33 PM
elukey updated the task description. (Show Details)Sep 15 2017, 8:13 AM

I removed puppet/salt credentials, powered down and wiped puppet, but didn't proceed any further since I didn't want to mess with DC-Ops procedure. I was under the impression that even the non interruptible steps could have been done without DC-Ops supervision, but I might have misinterpreted the procedure (in case I am really sorry for that).

Halfak added a subscriber: Halfak.Oct 27 2017, 5:24 PM

Please do not wipe these disks until T179189: Locate data from /srv on stat1003 is resolved.

@Cmjohnson it is looking like there was some files that @Halfak needs that I did not fully sync over from stat1003 to stat1006. I'm going to try to power stat1003 back on and see if I can find what @Halfak is missing.

@Ottomata okay please update task when it's okay to wipe. Thanks

Ok, I've saved halfak's data. This should be ok to wipe.

@Cmjohnson, stat1003 is still powered up. I can shut it down (sudo poweroff?), or let you do it. (I'm not sure if you have some more specific way of shutting down a machine for decom.)

Dzahn added a subscriber: Dzahn.EditedDec 6 2017, 10:34 PM

I noticed this host was up and running but not in site.pp / had no roles, as part of decoming Ganglia from everything (T177225). It's gone from Icinga but running, which we are trying to avoid since then hosts don't get security upgrades but are also not fully down. I am adding it back to site.pp with role(spare) per decom workflow because this prevents a couple issues, like Ganglia will be removed and the host will keep getting security upgrades.

Change 395874 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add stat1003 ghost back to site

https://gerrit.wikimedia.org/r/395874

Change 395874 merged by Dzahn:
[operations/puppet@production] site: add stat1003 ghost back to site

https://gerrit.wikimedia.org/r/395874

Mentioned in SAL (#wikimedia-operations) [2017-12-06T22:55:25Z] <mutante> stat1003 - re-enabled puppet after putting role::spare on it (T175150)

RobH claimed this task.Feb 8 2018, 10:51 PM
RobH edited projects, added hardware-requests; removed Patch-For-Review.
RobH updated the task description. (Show Details)Feb 8 2018, 10:53 PM

Change 409180 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] stat1003 decom

https://gerrit.wikimedia.org/r/409180

Change 409181 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom stat1003 prod dns entries

https://gerrit.wikimedia.org/r/409181

Change 409180 merged by RobH:
[operations/puppet@production] stat1003 decom

https://gerrit.wikimedia.org/r/409180

Change 409181 merged by RobH:
[operations/dns@master] decom stat1003 prod dns entries

https://gerrit.wikimedia.org/r/409181

RobH reassigned this task from RobH to Cmjohnson.Feb 8 2018, 10:59 PM
RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)
RobH moved this task from Backlog to Reclaim (Spares/Decommission) on the hardware-requests board.
RobH added a subscriber: RobH.

Change 428719 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing mgmt entries stat1003

https://gerrit.wikimedia.org/r/428719

Change 428719 merged by Cmjohnson:
[operations/dns@master] Removing mgmt entries stat1003

https://gerrit.wikimedia.org/r/428719

Cmjohnson closed this task as Resolved.Apr 24 2018, 7:23 PM
Cmjohnson updated the task description. (Show Details)