Reclaim/Decommission old codfw mc2001->mc2016 hosts
Closed, ResolvedPublic

Description

The old mc2001->mc2016 hosts are not used anymore and need to be decommissioned (new nodes are already serving traffic).

  • - confirm system is not serving traffic
  • - remove from any dsh groiups, lvs, heira, all puppet references
  • - if system will stay on for awhile, put in as site.pp in role::spare. if turning off, this can be skipped
  • - power down system
  • - disable system network switch port (must be done after system is powered off!)
  • - remove production dns entries (leave mgmt until system is unracked.)
  • - system disks wiped (by onsite)
  • - system unracked for decom (by onsite), racktables updated
  • - system mgmt dns entries removed.
  • - system's network port info wiped from switches
elukey created this task.Feb 9 2017, 12:34 PM

Change 336784 had a related patch set uploaded (by Elukey):
Assign role spare to mc2001->2016

https://gerrit.wikimedia.org/r/336784

Change 336784 merged by Elukey:
Assign role spare to mc2001->2016

https://gerrit.wikimedia.org/r/336784

elukey triaged this task as "Normal" priority.Feb 9 2017, 4:22 PM
elukey assigned this task to Papaul.
RobH edited the task description. (Show Details)Feb 10 2017, 4:51 PM
elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Feb 23 2017, 1:07 PM
Papaul reassigned this task from Papaul to elukey.Feb 23 2017, 9:17 PM
Papaul added a subscriber: Papaul.
elukey edited the task description. (Show Details)Feb 24 2017, 9:32 AM

Mentioned in SAL (#wikimedia-operations) [2017-02-24T09:39:50Z] <elukey> stop Redis and Memcached on mc2001->mc2016 as extra precautionary step before decom - T157675

Change 339611 had a related patch set uploaded (by Elukey):
Remove last settings for mc2001->mc2016 from puppet

https://gerrit.wikimedia.org/r/339611

elukey added subscribers: Cmjohnson, RobH.EditedFeb 24 2017, 10:03 AM

Hello @Papaul, @RobH and @Cmjohnson!

While reviewing https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS I decided to stop at https://gerrit.wikimedia.org/r/339611 since I saw the warning "These steps, once started, must be completed without interruption." and I don't have a lot of context about what do you guys prefer to do in these cases.

I am a bit swamped with other tasks for memcached/redis, if anybody of you could pick up the remaining of the work it would be really great. I have shutdown and masked redis/memcached on all the hosts as precaution, confirmed that no live traffic is served by those hosts.

@elukey I get take over from here. Thanks

RobH reassigned this task from elukey to Papaul.Feb 24 2017, 4:54 PM

Change 340195 had a related patch set uploaded (by Papaul; owner: Papaul):
DNS/Decom Remove production DNS for mc2001-mc2016

https://gerrit.wikimedia.org/r/340195

Papaul edited the task description. (Show Details)Feb 27 2017, 9:41 PM

Disks wipe in progress

Dzahn added a subscriber: Dzahn.Feb 28 2017, 1:38 AM

please remove mc2001 thru mc2007 from Icinga - they are reported as CRITical "host down" alerts and were also not acked or in downtime.

please remove mc2001 thru mc2007 from Icinga - they are reported as CRITical "host down" alerts and were also not acked or in downtime.

https://gerrit.wikimedia.org/r/#/c/339611 is ready to be merged, it should remove everything but in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS it seems that we'd need to wait to also disable the switch ports.

@Papaul let me know when you are ready for the switch ports, I'll take care of merging the code review.

Change 339611 merged by Dzahn:
Remove last settings for mc2001->mc2016 from puppet

https://gerrit.wikimedia.org/r/339611

Change 340195 merged by Dzahn:
DNS/Decom Remove production DNS for mc2001-mc2016

https://gerrit.wikimedia.org/r/340195

Mentioned in SAL (#wikimedia-operations) [2017-02-28T22:26:56Z] <mutante> (T157675) - revoke puppet certs, deactivate nodes, rm from icinga. [puppetmaster1001:~] $ for mcnode in $(seq 2001 2016); do puppet node clean mc${mcnode}.codfw.wmnet && puppet node deactivate mc${mcnode}.codfw.wmnet ; done

Mentioned in SAL (#wikimedia-operations) [2017-02-28T22:29:48Z] <mutante> (T157675) - delete salt keys - [neodymium:~] $ for mcnode in $(seq 2001 2016); do sudo salt-key -d mc${mcnode}.codfw.wmnet; done

The hosts are gone from icinga now after the commands above and running puppet on einsteinium.

mc2008-mc2016 are not properly removed from puppet, they still show up in servermon e.g.

Dzahn added a comment.Mar 2 2017, 4:50 PM

@Papaul @elukey can you confirm mc2008-mc2016 are physically shutdown?

Papaul added a comment.Mar 2 2017, 4:56 PM

@Dzahn mc2008- mc2012 are physically shutdown since disk wipe is in progress on those systems and not mc2013-mc2016

Dzahn added a comment.Mar 2 2017, 5:00 PM

Can we shutdown mc2013-mc2016 as well?

Dzahn added a comment.Mar 2 2017, 5:01 PM

I ran the exact same commands from above again, it removed 2008,2009 and 2012 from servermon.

Papaul added a comment.Mar 2 2017, 5:03 PM

@Dzahn mc2013-mc2016 are down

Dzahn added a comment.Mar 2 2017, 5:03 PM

Same for salt-keys. They were back and had to repeat the same command because servers were not shutdown yet. (2008 thru 2016)

Dzahn added a comment.Mar 2 2017, 5:07 PM

@Dzahn mc2013-mc2016 are down

Thanks. I ran the commands a third time. Running puppet on einsteinium. We need to physically shut servers down (or at least stop puppet and salt services) before these steps start.

Papaul added a comment.EditedMar 2 2017, 5:12 PM

@RobH if you want to disable switch ports please see below for switch port informationn

Rack A2
mc2001 xe-2/0/0
mc2002 xe-2/0/1
mc2003 xe-2/0/2

Rack A7
mc2004 xe-7/0/0
mc2005 xe-7/0/1
mc2006 xe-7/0/2

Rack B2
mc2007 xe-2/0/0
mc2008 xe-2/0/1
mc2009 xe-2/0/2

Rack B7
mc2010 xe-7/0/0
mc2011 xe-7/0/1
mc2012 xe-7/0/2

Rack C2
mc2013 xe-2/0/0
mc2014 xe-2/0/1
mc2015 xe-2/0/2

Rack C7
mc2016 xe-7/0/0

RobH edited the task description. (Show Details)Mar 2 2017, 6:03 PM
RobH added a comment.Mar 2 2017, 6:15 PM

@RobH if you want to disable switch ports please see below for switch port informationn

Rack A2
mc2001 xe-2/0/0
mc2002 xe-2/0/1
mc2003 xe-2/0/2

Rack A7
mc2004 xe-7/0/0
mc2005 xe-7/0/1
mc2006 xe-7/0/2

Rack B2
mc2007 xe-2/0/0
mc2008 xe-2/0/1
mc2009 xe-2/0/2

Rack B7
mc2010 xe-7/0/0
mc2011 xe-7/0/1
mc2012 xe-7/0/2

Rack C2
mc2013 xe-2/0/0
mc2014 xe-2/0/1
mc2015 xe-2/0/2

Rack C7
mc2016 xe-7/0/0

All switch ports have been set to disabled. Once the systems are unracked, please assign this task back to me to remove the switch port configuration for these, thanks!

RobH edited the task description. (Show Details)Mar 2 2017, 6:15 PM
Papaul edited the task description. (Show Details)Mar 6 2017, 3:56 PM
Papaul edited the task description. (Show Details)Mar 6 2017, 5:28 PM

Change 341352 had a related patch set uploaded (by pt1979):
[operations/dns] DNS/Decom Remove mgmt dns for mc2001-mc2016

https://gerrit.wikimedia.org/r/341352

Papaul edited the task description. (Show Details)Mar 6 2017, 5:42 PM
Papaul reassigned this task from Papaul to RobH.Mar 6 2017, 5:46 PM

@RobH I am done with this task you can go ahead and remove the port information on the switches.

Note
Some of those ports will be reused soon for the new ms-be servers

Dzahn added a comment.Mar 6 2017, 6:05 PM

@RobH when done can you confirm https://gerrit.wikimedia.org/r/#/c/341352/ is ready to go?

RobH added a comment.Mar 6 2017, 6:07 PM

That can merge, but leave this ticket open and assigned to me until I remove the port descriptions from the switches.

Change 341352 merged by Dzahn:
[operations/dns] DNS/Decom Remove mgmt dns for mc2001-mc2016

https://gerrit.wikimedia.org/r/341352

RobH edited the task description. (Show Details)Mar 6 2017, 11:14 PM
RobH closed this task as "Resolved".

removed descriptions from disabled switch ports, resolving task.