Page MenuHomePhabricator

Unprovision cache_misc @ ulsfo
Closed, ResolvedPublic

Description

With T164609 not resolving anytime soon due to complexity and time constraints, and the impending replacement of ulsfo cache hardware according to a new 2-clusters plan in T164327 , we need a shorter-term fixup that removes cache_misc @ ulsfo only. This is tricky, as there are assumptions built into some puppetization (e.g. ipsec) about "all clusters at all DCs", and similar assumptions at the gdnsd level. It's still easier than resolving T164609 itself.

Details

Related Gerrit Patches:

Event Timeline

BBlack created this task.May 5 2017, 6:00 PM
Restricted Application added a project: Operations. · View Herald TranscriptMay 5 2017, 6:00 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
faidon added a subscriber: faidon.May 8 2017, 1:17 PM

Undeploying cache_misc sounds unfortunate… Why not keep the existing 4-year old ulsfo hardware for cache-misc still, perhaps keeping some of the other old servers in there for parts?

BBlack added a subscriber: RobH.EditedMay 8 2017, 2:00 PM

We could do so as a goal at the end of the process, depending how we arrange things.

@RobH says we're short on power there to plug in all the new systems while the old ones are running. So there's a few basic shapes to the options (but details):

  1. Decom 8/20 old (cache_misc + cache_maps) via this ticket and the (much easier than misc) maps->upload cluster rollup. This frees enough power to bring up 6/12 of the new servers and transition either text or upload, then do the other, and we're complete without any need for big downtimes.
  2. The "all at once" option would be to downtime ulsfo long enough to wipe/decom all the old systems that are going away and unrack them, rack the new in their correct positions, install/vet them, and then bring them back into service from cold. This would probably allow keeping the current 4x old boxes for misc, for now. It would be at least a whole day's downtime, and no smooth transition of the big clusters by pooling nodes in and out.

The only way option 1 works for keeping misc is if we power it down during the transition and then bring it back afterwards. Option 2 lets it stay powered up (I'm pretty sure anyways), but with all the downtime / unsmooth-transition caveats there.

But really, there's not a huge amount of value to having cache_misc exist specifically in ulsfo (temporarily) in the first place. There are two other US-based datacenters to terminate it at, and we're far enough out from similar work at esams to get the software side done ahead of it. And there's the upside of getting rid of old-class machines in our configs.

ema moved this task from Triage to Caching on the Traffic board.May 9 2017, 8:35 AM
RobH added a comment.EditedMay 9 2017, 3:59 PM

So I'll add a few options/items for review:

  • The router/switches are racked at the tops of the racks. If we move them to mid rack level, there are no power plugs at the middle of the rack to be blocked by the router/switches connections. Right now, at the top of the racks, 5 power outlets are blocked per PDU tower by the router/switch/connections. The cross-rack access panel for our two racks is located at mid-rack already, so these racks would well support the shift to mid-rack-located networking (much like codfw).
  • These racks are quite shallow, so the fibers, mgmt, and power cables for each server are all smashed together against the PDU towers and the power ports/plugs there. I'd like to swap the fibers over to DAC to avoid them being broken every time the rack door is open or shut. Additionally, longer DAC cables may allow me to route them in front of the PDU, deeper towards the middle line of the racks. I wouldn't want to shove bare fiber inside the rack like that, but the DAC cables are far tougher.
    • I've also emailed UnitedLayer to ask about using deeper rack doors. Some racks in our row have special doors on the hot side, that are an additional 3" deep projection from the back of the rack, resulting in more cable routing space in the rack. I've emailed to ask about availability, and if it would cost us anything.
  • DAC cable order is on T164846.

I'm not sure if its feasible to shift the networking equipment down in the rack at the same time as the cache re-provisioning (if option 2, all at once, is selected.)

  • For disk wipes, if we want to do all at once, we can leave just 5 old systems total, and use them to house 8 SFF HDDs each and wipe them all at the same time. It seems all the older cp boxes are either dual SSD, or dual SFF HDD. That means to wipe them all, we need 5 systems to house all disks at the same time. This would just cover the older cp box wipes, not the bastion/backup server disk wipes.
  • the old backup system in ulsfo is decommissioned, i'll start a disk wipe next onsite so it can get unracked and give us more space.

Change 361777 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/dns@master] cache_misc: take ulsfo IPs out of effective service

https://gerrit.wikimedia.org/r/361777

Change 361779 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] cache_misc: remove ulsfo nodes and IPs

https://gerrit.wikimedia.org/r/361779

Change 361777 merged by BBlack:
[operations/dns@master] cache_misc: take ulsfo IPs out of effective service

https://gerrit.wikimedia.org/r/361777

Change 361779 merged by BBlack:
[operations/puppet@production] cache_misc: remove ulsfo nodes and IPs

https://gerrit.wikimedia.org/r/361779

Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts:

['cp4001.ulsfo.wmnet', 'cp4002.ulsfo.wmnet', 'cp4003.ulsfo.wmnet', 'cp4004.ulsfo.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201706272216_bblack_20362.log.

Completed auto-reimage of hosts:

['cp4001.ulsfo.wmnet', 'cp4002.ulsfo.wmnet', 'cp4003.ulsfo.wmnet', 'cp4004.ulsfo.wmnet']

Of which those FAILED:

set(['cp4001.ulsfo.wmnet', 'cp4003.ulsfo.wmnet'])
BBlack closed this task as Resolved.Jun 27 2017, 11:22 PM
BBlack claimed this task.

I had to manually fix up salt keys and do final reboots on 4001+4003, all should be sane and consistent now (except for a couple of IPMI temp checks showing UNKNOWN in icinga).