Sideways Only-If-Cached on misses at a primary DC
Closed, DeclinedPublic

Description

The TL;DR on this is "if eqiad is the active primary: if eqiad misses, and the request didn't arrive through codfw, do an Only-If-Cached request to codfw first before trying the applayer". This can be done in both directions simultaneously in the future active:active scenario as well. We already have datacenter-loop-prevention VCL to avoid issues with this, and the Only-If-Cached part would build on the same headers. Because the primaries have very resilient and relatively low-latency connectivity to each other, there's little downside to this approach. The upside is that requests coming directly into the frontends of a (or the) primary DC have access to remote cache contents in various cache-wipe scenarios.

I'm not sure this can be implemented sanely in Varnish 3, but it definitely can be in Varnish4 with how backend-side request-restart works. So it's best to block this until post-Varnish4.

BBlack created this task.Aug 12 2016, 2:39 PM
Restricted Application added a project: Operations. · View Herald TranscriptAug 12 2016, 2:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
faidon added a subscriber: faidon.Aug 12 2016, 10:38 PM

That would add at least a round-trip latency on every true miss that hits eqiad/esams/ulsfo (new or just purged page), won't it?

On a true miss, yes, it would add a codfw<->eqiad round-trip. That's ~35ms though, which may be hard for MW to beat on average. True miss should be rare though, except when a backend cache has been wiped, in which case we'd rather spam eqiad<->codfw to side-reload the cache than spam the applayer.

This wouldn't apply to passes from hit-for-pass, though. It also shouldn't apply to a normal page expiry within the grace window. I'm unsure about purges, probably depends whether we use 'softpurge' or not (which allows the page contents to still be in grace briefly - seems like a fine choice given purging isn't synchronous anyways, if we can keep the grace window small).

Oh, re-reading your question, you mentioned specific DCs. In the current layout where only eqiad is "primary", the side-checks from eqiad to codfw would only happen for requests that initially enter through esams or eqiad. ulsfo flows through codfw on its way to eqiad.

Of course, if MW can beat an eqiad<->codfw trip for the same page... we could look at other ways to structure this so it doesn't kick in all the time. Perhaps trigger it only in the first X minutes after a fresh varnishd restart, or only trigger it when we run out of connection parallelism limit to the real mediawiki backend?

ema triaged this task as Normal priority.Aug 15 2016, 9:45 AM
ema moved this task from Triage to Caching on the Traffic board.Sep 30 2016, 2:37 PM
BBlack closed this task as Declined.Oct 6 2016, 5:53 PM

This seems really complicated to get "right", and it's only in corner cases that it even helps us much. There's potential downsides on the pattern-adaptation of each DC's backend storage to its regional variance, and of course @faidon's latency argument. If/when we move backend cache storage to ATS, we can re-evaluate similar ideas there (they may be able to make miss-fetch attempts in multiple directions in parallel with multicast, etc...).