Sideways Only-If-Cached on misses at a primary DC
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	BBlack
	Aug 12 2016, 2:39 PM

Description

The TL;DR on this is "if eqiad is the active primary: if eqiad misses, and the request didn't arrive through codfw, do an Only-If-Cached request to codfw first before trying the applayer". This can be done in both directions simultaneously in the future active:active scenario as well. We already have datacenter-loop-prevention VCL to avoid issues with this, and the Only-If-Cached part would build on the same headers. Because the primaries have very resilient and relatively low-latency connectivity to each other, there's little downside to this approach. The upside is that requests coming directly into the frontends of a (or the) primary DC have access to remote cache contents in various cache-wipe scenarios.

I'm not sure this can be implemented sanely in Varnish 3, but it definitely can be in Varnish4 with how backend-side request-restart works. So it's best to block this until post-Varnish4.

Related Objects
Search...

Status	Assigned	Task
Declined	None	T142841 Sideways Only-If-Cached on misses at a primary DC
Resolved	• ema	T131499 Upgrade all cache clusters to Varnish 4
Resolved	• ema	T126206 Upgrade to Varnish 4: things to remember
Resolved	• ema	T128788 Port varnishlog.py to new VSL API
Resolved	• ema	T131353 Port remaining scripts depending on varnishlog.py to new VSL API
Resolved	• ema	T131501 Convert misc cluster to Varnish 4
Resolved	• ema	T134989 WDQS empty response - transfer clsoed with 15042 bytes remaining to read
Resolved	• ema	T131502 Convert upload cluster to Varnish 4
Resolved	BBlack	T131761 Solve large-object/stream/pass/chunked in upload cluster better
Resolved	• ema	T142076 Analyze Range requests on cache_upload frontend
Resolved	• ema	T142233 Varnish 4 stalls with two consecutive Range requests using HTTP persistent connections
Resolved	• ema	T131503 Convert text cluster to Varnish 4
Resolved	BBlack	T135696 Sort out vcl_deliver vs vcl_synth mess with v4 VCL
Resolved	• ema	T150660 Post Varnish 4 migration cleanup

Event Timeline

BBlack created this task.Aug 12 2016, 2:39 PM

Restricted Application added a project: SRE. · View Herald TranscriptAug 12 2016, 2:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BBlack added a parent task: T131499: Upgrade all cache clusters to Varnish 4.Aug 12 2016, 2:39 PM

BBlack removed a parent task: T131499: Upgrade all cache clusters to Varnish 4.

BBlack added a subtask: T131499: Upgrade all cache clusters to Varnish 4.

BBlack mentioned this in T142848: Stop using persistent storage in our backend varnish layers..Aug 12 2016, 4:50 PM

That would add at least a round-trip latency on every true miss that hits eqiad/esams/ulsfo (new or just purged page), won't it?

On a true miss, yes, it would add a codfw<->eqiad round-trip. That's ~35ms though, which may be hard for MW to beat on average. True miss should be rare though, except when a backend cache has been wiped, in which case we'd rather spam eqiad<->codfw to side-reload the cache than spam the applayer.

This wouldn't apply to passes from hit-for-pass, though. It also shouldn't apply to a normal page expiry within the grace window. I'm unsure about purges, probably depends whether we use 'softpurge' or not (which allows the page contents to still be in grace briefly - seems like a fine choice given purging isn't synchronous anyways, if we can keep the grace window small).

Oh, re-reading your question, you mentioned specific DCs. In the current layout where only eqiad is "primary", the side-checks from eqiad to codfw would only happen for requests that initially enter through esams or eqiad. ulsfo flows through codfw on its way to eqiad.

Of course, if MW can beat an eqiad<->codfw trip for the same page... we could look at other ways to structure this so it doesn't kick in all the time. Perhaps trigger it only in the first X minutes after a fresh varnishd restart, or only trigger it when we run out of connection parallelism limit to the real mediawiki backend?

• ema triaged this task as Medium priority.Aug 15 2016, 9:45 AM

• ema moved this task from Backlog to Caching on the Traffic board.Sep 30 2016, 2:37 PM

This seems really complicated to get "right", and it's only in corner cases that it even helps us much. There's potential downsides on the pattern-adaptation of each DC's backend storage to its regional variance, and of course @faidon's latency argument. If/when we move backend cache storage to ATS, we can re-evaluate similar ideas there (they may be able to make miss-fetch attempts in multiple directions in parallel with multicast, etc...).

• ema closed subtask T131499: Upgrade all cache clusters to Varnish 4 as Resolved.Nov 24 2016, 3:07 PM

Sideways Only-If-Cached on misses at a primary DCClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Sideways Only-If-Cached on misses at a primary DC
Closed, DeclinedPublic
Actions

Related Objects
Search...