Page MenuHomePhabricator

decommission elastic1021
Closed, ResolvedPublic

Description

elastic1021 has a failed dimm, recorded on T188595. The discovery team reviewed options with @RobH & @faidon and came to the decision to simply decommission this particular server.

elastic1021:

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - remove site.pp (replace with role(spare::system) if system isn't shut down immediately during this process.)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal) - asw-c-eqiad:ge-4/0/35
  • - remove all remaining puppet references (include role::spare)
  • - remove production dns entries
  • - puppet node clean, puppet node deactivate

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - add system to decommission tracking google sheet
  • - mgmt dns entries removed.

Event Timeline

RobH triaged this task as Medium priority.Mar 14 2018, 7:13 PM
RobH created this task.

So this system was already offline for memory testing, but I want to confirm with @Gehel we're good to start decommission, which includes wipe of data.

@Gehel: Can you confirm you don't need any information off this host? Additionally, please ensure no other configuration files expect it. I can take over after ' - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.'

Thanks!

Change 419702 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] elastic: decommission elastic1021

https://gerrit.wikimedia.org/r/419702

Preliminary decommissioning steps are done (pending the merge of https://gerrit.wikimedia.org/r/#/c/419702/). A few notes:

Since elastic1021 is down and does not want to restart, the services normally running on it have not been stopped / masked. Also, modifying the icinga checks would require hacking into puppetdb. The checks are silences and will be purged during the puppet node cleanup phase.

From a service perspective, that host is banned from the cluster and marked as inactive in conftool. Even if it comes back from the deads for whatever reason, it should not cause any trouble.

Thakns @Gehel, I'll steal and proceed from here!

Change 419702 merged by Gehel:
[operations/puppet@production] elastic: decommission elastic1021

https://gerrit.wikimedia.org/r/419702

Since the host is down, I cannot power it on and disable puppet. I have disabled the switch port, so if it does power on it will be fine.

Change 419811 had a related patch set uploaded (by RobH; owner: RobH):
[operations/puppet@production] decom elastic1021

https://gerrit.wikimedia.org/r/419811

Change 419811 merged by RobH:
[operations/puppet@production] decom elastic1021

https://gerrit.wikimedia.org/r/419811

Change 419813 had a related patch set uploaded (by RobH; owner: RobH):
[operations/dns@master] decom elastic1021 prod dns

https://gerrit.wikimedia.org/r/419813

RobH removed a project: Patch-For-Review.
RobH updated the task description. (Show Details)

This is now ready for disk wipe and unracking. Please note that since the system won't power on, you'll need to move the disks to a working system to wipe. (Perhaps another system you are already doing disk wipes on.)

Thanks!

Removing search backend team from this ticket, nothing left to do on our side.

Change 427420 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Removing dns for elastic1021

https://gerrit.wikimedia.org/r/427420

Change 427420 merged by Cmjohnson:
[operations/dns@master] Removing dns for elastic1021

https://gerrit.wikimedia.org/r/427420

Cmjohnson updated the task description. (Show Details)

Updated rackables and tracking sheet.

Change 419813 abandoned by RobH:
decom elastic1021 prod dns

https://gerrit.wikimedia.org/r/419813