Decommission old memcached hosts - mc1001->mc1018
Open, NormalPublic

Description

The old mc1001->mc1018 hosts need to be decommissioned (new nodes are already serving traffic).

  • - all system services confirmed offline from production use
  • - set all icinga checks to maint mode/disabled while reclaim/decommmission takes place.
  • - remove system from all lvs/pybal active configuration
  • - any service group puppet/heira/dsh config removed
  • - Set role::spare (system was not shut down)

START NON-INTERRUPPTABLE STEPS

  • - disable puppet on host
  • - remove all remaining puppet references (include role::spare)
  • - power down host
  • - disable switch port
  • - switch port assignment noted on this task (for later removal)
  • - remove production dns entries https://gerrit.wikimedia.org/r/#/c/346823/
  • - puppet node clean, puppet node deactivate, salt key removed

END NON-INTERRUPPTABLE STEPS

  • - system disks wiped (by onsite)
  • - system unracked and decommissioned (by onsite), update racktables with result
  • - switch port configration removed from switch once system is unracked.
  • - mgmt dns entries removed.
elukey created this task.May 3 2017, 7:47 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 3 2017, 7:47 AM
Cmjohnson moved this task from Backlog to Not urgent on the ops-eqiad board.May 3 2017, 1:13 PM
Cmjohnson moved this task from Not urgent to Up next on the ops-eqiad board.May 9 2017, 3:55 PM
elukey updated the task description. (Show Details)May 17 2017, 8:55 AM
elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.May 18 2017, 4:42 PM

Change 354453 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove any reference of mc1001->mc1018 for decom

https://gerrit.wikimedia.org/r/354453

elukey updated the task description. (Show Details)May 23 2017, 6:49 AM
elukey added a subscriber: Cmjohnson.

@Cmjohnson: The hosts are ready for the non interruptible steps, including https://gerrit.wikimedia.org/r/354453, so I haven't merged it yet. Icinga alarms are off.

elukey assigned this task to Cmjohnson.May 23 2017, 6:50 AM
elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.May 26 2017, 12:42 PM
Cmjohnson moved this task from Up next to Not urgent on the ops-eqiad board.May 30 2017, 4:32 PM
elukey moved this task from Stalled to Done on the User-Elukey board.Jun 3 2017, 5:55 AM
Cmjohnson moved this task from Not urgent to Decommission on the ops-eqiad board.Jul 20 2017, 3:24 PM
elukey added a comment.Aug 4 2017, 9:39 AM

Any news on this? :)

@elukey, @Joe, @Cmjohnson: for testing purposes of the migration of the reimage script from salt to cumin, could I grab mc100[1-2] in the next few days as test hosts for the reimage script?

They are already in spare role in puppet, but let me know if you see any reason it's best not to take those ones.

Mentioned in SAL (#wikimedia-operations) [2017-09-08T14:58:08Z] <volans> testing wmf-auto-reimage also on mc1002 T166300 T164341