This is an overview of all the pieces, intended to link out to blockers for specific bits of work!
Our tentative plan for switching just the cache layers to codfw as the primary site, ahead of and independent of any application layer switch, and while preserving eqiad's cache contents in case of corruption during switchover testing:
- Switch ulsfo's backend source DC from eqiad to codfw
- Can/should be done days ahead of the rest of these steps
- this effectively makes ulsfo "tier-3" in practice
- proves that 3-tiered caching is ok
- helps to make codfw's caches warmer than they are with their normal light traffic today.
- Likely to keep this as permanent normal state, regardless of the rest of these goals' status
- Depool eqiad frontends from gdnsd (we know this works; it's a normal operation)
- Should try to get this in 24h ahead of the real switching below, to allow time for bad DNS to expire
- Switch esams as with step 1 above.
- (but only temporarily, for the switch test)
- Switch codfw's source from "eqiad caches" to "eqiad applayer"
Separately, after the above, we need to switch codfw's source from "eqiad applayer" to "codfw applayer" at the right moment in time for the application layer testing. That switch should be controllable per-service (e.g. MediaWiki, RestBase, Swift, etc). While this is coordinated with the application layer switch, there's no hard requirement on sychronicity with anything else (or for the caches themselves to do a perfectly-synchronous source switch, etc).
All of these things probably can be done in some sense today, but each step would require some heavy, tricky, and error-prone commits all over the puppet repo to change the relevant VCL. What we're lacking is improvements to all of that puppetization such that the tasks above can be accomplished with reliable, simple actions when necessary at runtime. Preferably those actions happen through confd/confctl, but even getting it down to where the switches involve one-liner data updates somewhere in hieradata/ would be an improvement, and a possible first step towards eventual confd control. The capabilities that need implementing (considering the above as well as future related directions that make sense) are:
- Our backend VCL needs to be transitive where necessary (when 3-tier traffic flows through backends).
- All DCs need the ability to switch their underlying "source" DC between the other available DCs or "direct" meaning to contact the applayer directly instead of using caches at another DC.
- The applayer services that caches contact when set to "direct" should also be switchable, at the granularity of per-cache-DC + per-service, between multiple choices of applayer DC.
Subtasks will block this task related to accomplishing all the above.