Page MenuHomePhabricator

Traffic Infrastructure support for Mar 2016 codfw rollout
Closed, ResolvedPublic

Description

This is an overview of all the pieces, intended to link out to blockers for specific bits of work!

Our tentative plan for switching just the cache layers to codfw as the primary site, ahead of and independent of any application layer switch, and while preserving eqiad's cache contents in case of corruption during switchover testing:

  1. Switch ulsfo's backend source DC from eqiad to codfw
    • Can/should be done days ahead of the rest of these steps
    • this effectively makes ulsfo "tier-3" in practice
    • proves that 3-tiered caching is ok
    • helps to make codfw's caches warmer than they are with their normal light traffic today.
    • Likely to keep this as permanent normal state, regardless of the rest of these goals' status
  2. Depool eqiad frontends from gdnsd (we know this works; it's a normal operation)
    • Should try to get this in 24h ahead of the real switching below, to allow time for bad DNS to expire
  3. Switch esams as with step 1 above.
    • (but only temporarily, for the switch test)
  4. Switch codfw's source from "eqiad caches" to "eqiad applayer"

Separately, after the above, we need to switch codfw's source from "eqiad applayer" to "codfw applayer" at the right moment in time for the application layer testing. That switch should be controllable per-service (e.g. MediaWiki, RestBase, Swift, etc). While this is coordinated with the application layer switch, there's no hard requirement on sychronicity with anything else (or for the caches themselves to do a perfectly-synchronous source switch, etc).

All of these things probably can be done in some sense today, but each step would require some heavy, tricky, and error-prone commits all over the puppet repo to change the relevant VCL. What we're lacking is improvements to all of that puppetization such that the tasks above can be accomplished with reliable, simple actions when necessary at runtime. Preferably those actions happen through confd/confctl, but even getting it down to where the switches involve one-liner data updates somewhere in hieradata/ would be an improvement, and a possible first step towards eventual confd control. The capabilities that need implementing (considering the above as well as future related directions that make sense) are:

  1. Our backend VCL needs to be transitive where necessary (when 3-tier traffic flows through backends).
  2. All DCs need the ability to switch their underlying "source" DC between the other available DCs or "direct" meaning to contact the applayer directly instead of using caches at another DC.
  3. The applayer services that caches contact when set to "direct" should also be switchable, at the granularity of per-cache-DC + per-service, between multiple choices of applayer DC.

Subtasks will block this task related to accomplishing all the above.

Event Timeline

BBlack created this task.Feb 2 2016, 1:12 PM
BBlack raised the priority of this task from to Medium.
BBlack updated the task description. (Show Details)
BBlack added subscribers: BBlack, faidon.
Restricted Application added projects: Operations, codfw-rollout. · View Herald TranscriptFeb 2 2016, 1:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack added a comment.Feb 2 2016, 1:16 PM

Should also note: while the above list of steps 1-5 sounds roughly correct for a true switch, we probably want to sort out some minor alterations to the plan such that we can avoid harming eqiad's disk caches if things go wrong (in other words: if codfw appservers/caches serve horrible broken content, we don't want that to pollute eqiad disk cache and make revert difficult - it's nice to still have good content in them throughout initial testing as a backup).

mark added a subscriber: mark.Feb 10 2016, 3:43 PM

Should also note: while the above list of steps 1-5 sounds roughly correct for a true switch, we probably want to sort out some minor alterations to the plan such that we can avoid harming eqiad's disk caches if things go wrong (in other words: if codfw appservers/caches serve horrible broken content, we don't want that to pollute eqiad disk cache and make revert difficult - it's nice to still have good content in them throughout initial testing as a backup).

Do you think it's reasonable to not use eqiad caches at all for a while? In theory codfw should be able to handle the load, be mostly warmed up anyway, and this wouldn't present any risk to the eqiad cache content.

Do you think it's reasonable to not use eqiad caches at all for a while? In theory codfw should be able to handle the load, be mostly warmed up anyway, and this wouldn't present any risk to the eqiad cache content.

Yeah. Basically we'd insert these two pre-steps before the steps listed at the top, and outside of the primary switch/outage time...

-1. Depool eqiad from user traffic via geodns (takes 10 minutes for bulk of traffic, may want to wait even longer...)
0. Block remaining junk traffic at eqiad frontends from clients who don't pay attention to DNS updates (should be tiny, but even if we waited days I bet there's some there from horribly-broken client and/or recursor -level caching and/or hardcoded IPs out there in the world).

... and then skip Step 4 where eqiad backends are reconfigured to talk to codfw

Copying in notes from meeting etherpad, which capture some assumptions/thinking beyond what's currently in this ticket:

  • Varnish/Traffic codfw tier-2 -> tier-1 promotion + eqiad demotion
    • *if* we're willing to drop the requirement of x-dc PII crypto for the duration of the trial switch, this gets much simpler and async, maybe
      • (or alternatively, actually fix x-dc PII for traffic tier1 with proxies or new varnish code, but unlikely by EOQ?)
    • if not, probably cache+app layer switchover are synchronously inter-twined with present plans
  • text and upload can be handled independently (swift test/move separate from MW+RB)
  • cache_parsoid - will be gone, assume it doesn't exist for these purposes
  • Restbase - currently assuming we can treat it like MediaWiki and fail traffic over in the same way at the same time (switch from eqiad.wmnet endpoints to codfw.wmnet endpoints for varnish backending)...
    • not a dealbreaker if they need to switch independently, just a little more work/complexity
BBlack renamed this task from Ability to switch Traffic infrastructure Tier-1 to codfw manually to Traffic Infrastructure support for Mar 2016 codfw rollout.Feb 19 2016, 5:26 PM
BBlack updated the task description. (Show Details)
BBlack set Security to None.

Top description updated with current plans/thinking, and morphed into a meta-task that others naturally fit underneath.

BBlack updated the task description. (Show Details)Feb 19 2016, 5:49 PM
BBlack added a comment.Mar 4 2016, 2:23 PM

Status Update: The first chunk of work is done: we can supposedly do all of the switching in steps 1-4 in the description today via hieradata updates, and will test some of it early next week. The latter part in support of the actual applayer switches (from e.g. mw10xx to mw20xx as the backend that the caches ultimately speak to) is still Not Yet Implemented, but should be by the end of next week.

Change 275531 had a related patch set uploaded (by BBlack):
cache_upload: separate applayer backend for thumbs

https://gerrit.wikimedia.org/r/275531

BBlack added a comment.Mar 7 2016, 5:24 PM

Removing the 2x confd-related blocker tasks: they'll still be open tasks tagged for codfw-rollout, but they're not essential for this particular EOQ goal. They're more like nice-to-have long term improvements to the process, which aren't strictly necessary today.

BBlack added a comment.Mar 7 2016, 5:27 PM

Overall status update: work is complete at the configuration level to support the necessary switches. What remains is testing (this week!) of various switching capabilities at the Traffic-only layer independently of the big application-layer switches, including the probably permanent switch of ulsfo's routing to backend to codfw instead of eqiad.

Change 275531 merged by BBlack:
cache_upload: separate applayer backend for thumbs

https://gerrit.wikimedia.org/r/275531

BBlack added a comment.Mar 8 2016, 6:26 PM

Status update: The only remaining work here ahead of the big switches of the applayer services is:

  1. Test codfw direct traffic: set codfw as direct in cache::route_table for one or more clusters temporarily and ensure it works as expected.
  2. Test applayer route switch to codfw: temporarily set the route attribute of our or more application services to codfw. Good candidates would be appservers_debug (easy) or RB (theoretically should work!).
  3. Stretch: implement split applayer routing as in the WIP patch here: https://gerrit.wikimedia.org/r/#/c/275497/ , and then test it on appservers_debug and/or RB as well.

Change 276252 had a related patch set uploaded (by BBlack):
appservers_debug: switch to codfw for cache->app

https://gerrit.wikimedia.org/r/276252

Change 276252 merged by BBlack:
appservers_debug: switch to codfw for cache->app

https://gerrit.wikimedia.org/r/276252

BBlack added a comment.Mar 9 2016, 8:27 PM

Status updates on the 3x things mentioned a couple updates above:

  1. (codfw direct): not yet tested, but planning to later today. Worst case should be perf impact, mostly for logged-in users. Will revert after success.
  2. (applayer switch a service eqiad->codfw): tested via appservers_debug in https://gerrit.wikimedia.org/r/276252 - works as expected! Can leave this going for better debugging during (1) above.
  3. (split) the commit works as intended. Will deploy this ahead of (1) so that 'split' can be tested on appservers_debug before the revert as well.

Change 276259 had a related patch set uploaded (by BBlack):
cache_text: codfw->direct routing

https://gerrit.wikimedia.org/r/276259

Change 276260 had a related patch set uploaded (by BBlack):
appservers_debug: split routing

https://gerrit.wikimedia.org/r/276260

Change 276259 merged by BBlack:
cache_text: codfw->direct routing

https://gerrit.wikimedia.org/r/276259

Change 276260 merged by BBlack:
appservers_debug: split routing

https://gerrit.wikimedia.org/r/276260

Status Update: all of the basic switch functionality is now live-tested.

2016-03-09 23:05 - 23:55 is the approx test window for when cache_text had both codfw and eqiad directly accessing eqiad appservers, and appservers_debug has already been used to test applayer backend switching (codfw and/or eqiad access foo.svc.codfw instead of foo.svc.eqiad).

The 'split' work still has puppetization problems, and really at this point I'm beginning to rethink its design. In any case, it's not necessary for this EOQ.

I would close this as resolved, but I've realized we need one more small chunk of work (filed as new blocker T129424) to support the switch testing. Should be able to knock that out tonight or tomorrow.

BBlack closed this task as Resolved.Mar 23 2016, 4:02 PM
BBlack claimed this task.
BBlack moved this task from Backlog to Done on the codfw-rollout-Jan-Mar-2016 board.