Page MenuHomePhabricator

Reclaim/Decommission old codfw mc2001->mc2016 hosts
Closed, ResolvedPublic

Description

The old mc2001->mc2016 hosts are not used anymore and need to be decommissioned (new nodes are already serving traffic).

  • - confirm system is not serving traffic
  • - remove from any dsh groiups, lvs, heira, all puppet references
  • - if system will stay on for awhile, put in as site.pp in role::spare. if turning off, this can be skipped
  • - power down system
  • - disable system network switch port (must be done after system is powered off!)
  • - remove production dns entries (leave mgmt until system is unracked.)
  • - system disks wiped (by onsite)
  • - system unracked for decom (by onsite), racktables updated
  • - system mgmt dns entries removed.
  • - system's network port info wiped from switches

Event Timeline

Change 336784 had a related patch set uploaded (by Elukey):
Assign role spare to mc2001->2016

https://gerrit.wikimedia.org/r/336784

Change 336784 merged by Elukey:
Assign role spare to mc2001->2016

https://gerrit.wikimedia.org/r/336784

elukey triaged this task as Medium priority.

Mentioned in SAL (#wikimedia-operations) [2017-02-24T09:39:50Z] <elukey> stop Redis and Memcached on mc2001->mc2016 as extra precautionary step before decom - T157675

Change 339611 had a related patch set uploaded (by Elukey):
Remove last settings for mc2001->mc2016 from puppet

https://gerrit.wikimedia.org/r/339611

Hello @Papaul, @RobH and @Cmjohnson!

While reviewing https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS I decided to stop at https://gerrit.wikimedia.org/r/339611 since I saw the warning "These steps, once started, must be completed without interruption." and I don't have a lot of context about what do you guys prefer to do in these cases.

I am a bit swamped with other tasks for memcached/redis, if anybody of you could pick up the remaining of the work it would be really great. I have shutdown and masked redis/memcached on all the hosts as precaution, confirmed that no live traffic is served by those hosts.

@elukey I get take over from here. Thanks

Change 340195 had a related patch set uploaded (by Papaul; owner: Papaul):
DNS/Decom Remove production DNS for mc2001-mc2016

https://gerrit.wikimedia.org/r/340195

please remove mc2001 thru mc2007 from Icinga - they are reported as CRITical "host down" alerts and were also not acked or in downtime.

please remove mc2001 thru mc2007 from Icinga - they are reported as CRITical "host down" alerts and were also not acked or in downtime.

https://gerrit.wikimedia.org/r/#/c/339611 is ready to be merged, it should remove everything but in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Steps_for_DC-OPS it seems that we'd need to wait to also disable the switch ports.

@Papaul let me know when you are ready for the switch ports, I'll take care of merging the code review.

Change 339611 merged by Dzahn:
Remove last settings for mc2001->mc2016 from puppet

https://gerrit.wikimedia.org/r/339611

Change 340195 merged by Dzahn:
DNS/Decom Remove production DNS for mc2001-mc2016

https://gerrit.wikimedia.org/r/340195

Mentioned in SAL (#wikimedia-operations) [2017-02-28T22:26:56Z] <mutante> (T157675) - revoke puppet certs, deactivate nodes, rm from icinga. [puppetmaster1001:~] $ for mcnode in $(seq 2001 2016); do puppet node clean mc${mcnode}.codfw.wmnet && puppet node deactivate mc${mcnode}.codfw.wmnet ; done

Mentioned in SAL (#wikimedia-operations) [2017-02-28T22:29:48Z] <mutante> (T157675) - delete salt keys - [neodymium:~] $ for mcnode in $(seq 2001 2016); do sudo salt-key -d mc${mcnode}.codfw.wmnet; done

The hosts are gone from icinga now after the commands above and running puppet on einsteinium.

mc2008-mc2016 are not properly removed from puppet, they still show up in servermon e.g.

@Papaul @elukey can you confirm mc2008-mc2016 are physically shutdown?

@Dzahn mc2008- mc2012 are physically shutdown since disk wipe is in progress on those systems and not mc2013-mc2016

Can we shutdown mc2013-mc2016 as well?

I ran the exact same commands from above again, it removed 2008,2009 and 2012 from servermon.

Same for salt-keys. They were back and had to repeat the same command because servers were not shutdown yet. (2008 thru 2016)

@Dzahn mc2013-mc2016 are down

Thanks. I ran the commands a third time. Running puppet on einsteinium. We need to physically shut servers down (or at least stop puppet and salt services) before these steps start.

@RobH if you want to disable switch ports please see below for switch port informationn

Rack A2
mc2001 xe-2/0/0
mc2002 xe-2/0/1
mc2003 xe-2/0/2

Rack A7
mc2004 xe-7/0/0
mc2005 xe-7/0/1
mc2006 xe-7/0/2

Rack B2
mc2007 xe-2/0/0
mc2008 xe-2/0/1
mc2009 xe-2/0/2

Rack B7
mc2010 xe-7/0/0
mc2011 xe-7/0/1
mc2012 xe-7/0/2

Rack C2
mc2013 xe-2/0/0
mc2014 xe-2/0/1
mc2015 xe-2/0/2

Rack C7
mc2016 xe-7/0/0

@RobH if you want to disable switch ports please see below for switch port informationn

Rack A2
mc2001 xe-2/0/0
mc2002 xe-2/0/1
mc2003 xe-2/0/2

Rack A7
mc2004 xe-7/0/0
mc2005 xe-7/0/1
mc2006 xe-7/0/2

Rack B2
mc2007 xe-2/0/0
mc2008 xe-2/0/1
mc2009 xe-2/0/2

Rack B7
mc2010 xe-7/0/0
mc2011 xe-7/0/1
mc2012 xe-7/0/2

Rack C2
mc2013 xe-2/0/0
mc2014 xe-2/0/1
mc2015 xe-2/0/2

Rack C7
mc2016 xe-7/0/0

All switch ports have been set to disabled. Once the systems are unracked, please assign this task back to me to remove the switch port configuration for these, thanks!

Change 341352 had a related patch set uploaded (by pt1979):
[operations/dns] DNS/Decom Remove mgmt dns for mc2001-mc2016

https://gerrit.wikimedia.org/r/341352

@RobH I am done with this task you can go ahead and remove the port information on the switches.

Note
Some of those ports will be reused soon for the new ms-be servers

That can merge, but leave this ticket open and assigned to me until I remove the port descriptions from the switches.

Change 341352 merged by Dzahn:
[operations/dns] DNS/Decom Remove mgmt dns for mc2001-mc2016

https://gerrit.wikimedia.org/r/341352

RobH updated the task description. (Show Details)

removed descriptions from disabled switch ports, resolving task.