Page MenuHomePhabricator

Solve large-object/stream/pass/chunked in upload cluster better
Closed, ResolvedPublic

Description

Basically, we should adapt the current upload VCL that deals with these issues to more-closely mirror the pattern established recently for this misc cluster in T128813 . Ideally we should resolve that just before/during the Varnish 4 transition of cache_upload.

Old Description:
The basics of the situation look like this:

  1. do_stream is not universally-good (it has negatives about slowing down concurrent requests for the same resource if the response is cacheable/shareable as opposed to "pass"), but for large objects we tend to want to turn it on to avoid taking multiple buffering delays through all of the cache layers. Therefore, in at least some places we'll want a content-length-sensitive block that turns on beresp.do_stream.
  2. For objects that are even larger than the above threshold, sometimes we want to just create a hit-for-pass object and not cache them at all. This usually only applies to frontend layers, since they have smaller total cache dataset sizes, and it's not very expensive to fetch them from the immediate (local) backend caches.
  3. However: some applayer backends, and sometimes varnish itself for inter-cache requests, use TE:chunked to stream out a response. This means the receiving cache doesn't get a Content-Length header in vcl_fetch on which to conditionally decide on whether to do_stream and/or hit-for-pass.

Currently, code related to the above only exists on the misc and upload clusters. In the misc case, by way of how the defaulting works, chunked responses are treated as if they break all size thresholds and are very large. In the upload case we do the opposite, and only treat objects as sufficiently-large to pass checks if they do have an explicit Content-Length which is above the limit (chunked are considered small responses).

Misc's behavior is a big part of what leads to the issues in T128813 , upload's behavior is also not optimal in the general case (but may be mostly-ok in practice at present there because the behavior of Swift is well-defined), and then the other services could probably use this sort of logic as well but don't have it at all.

There are multiple ways we could approach solving this set of issues for the general case (and move this to shared VCL, perhaps parameterizing the size cutoffs if necessary?), and this task is about exploring them.

My thoughts currently are:

  1. Regardless of how we solve the "default for unknown sizes" problem, if we used the same cutoff for do_stream that we do for hit-for-pass on the frontends, we'd avoid ever causing issues with do_stream and concurrent clients (which is that if it's cacheable and fetching for do_stream, the client triggering the fetch sets the speed at which the cache fetches the object for all concurrent clients. if this is a very slow client connection, it slows down the fetch for all the faster clients which stack up in the meantime). It should be a relatively low-incidence issue (as it's only on cache refresh of this cacheable object), but it's still not ideal. This doesn't solve the issue on backends though, which wouldn't hit-for-pass all streamable objects in an ideal world...
  1. We could simply never do_stream on the 'direct' (backend-most) backend caches (as they should fill fast anyways! the primary benefit of do_stream is crossing higher-latency inter-cache WAN links, and the final link from the frontend to the user), and send the length even if the varnish output is TE:chunked (as in, set resp.http.X-Size = obj.length or something along those sorts of lines), so that we can still make stream/pass decisions further up the caching chain regardless of TE:chunked. Don't know if that's possible...
  1. One of the biggest question-marks in my mind right now is: when does varnish choose to do TE:chunked output in general, and how does this affect inter-cache fetches and these size checks in general today?

Event Timeline

On point 3 above (when does varnish send TE:chunked?), my best observations/code-searching indicate:

  1. Obviously a do_stream of a chunked fetch is a chunked response to the client (real client or upstream varnish). I'm not sure about do_stream when the fetch from underneath isn't chunked.
  2. Even if it's not a do_stream fetch, if varnish has to gzip or gunzip on the fly for the response to the client (cache object is gzipped but client has no AE:gzip, or vice-versa (object not gzipped, client has AE:gzip) and the Content-Type matches our do_gzip list of compressible types), varnish sends TE:chunked
  3. It doesn't chunk for http/1.0 obviously. And it does chunk for ESI (which we don't use).

And on point 2 above: since varnish seems to be smart about using TE:chunked only when the response length isn't easy to know, there's not much wiggle room for injecting a valid alternative to Content-Length. Also, there is no metadata like object.length that VCL can access, that I can see.

Going into more detail on the current behaviors of misc and upload clusters:

cache_misc - regardless of tier/layer, it sets do_stream for objects >= 1MB, does hit-for-pass on objects >= 10MB, and treats all chunked fetches as large objects (above both thresholds).

cache_upload

  • Frontend, in datacenters which are 'direct': if object is >= 32MB, do_stream + hfp. All chunked fetches treated as large objects.
  • Frontend, in non-direct DCs: if object is >= 1MB, do_stream. If object >= 32MB, hfp. chunked fetches are considered large objects.
  • Backends, all DCs: if object >= 1MB, do_stream. chunked fetches are considered small objects in this case.
  • Note also that cache_upload has some do_stream/hfp/hash_ignore_busy behaviors related to Range requests, but that's separate

Also, confirmed that do_stream of a non-chunked fetch doesn't cause chunked response on cache_upload.

Another input here: in the common case, it seems MediaWiki outputs content with TE:chunked, too.

Change 281499 had a related patch set uploaded (by BBlack):
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

So I've spent a few days pondering all of this. There are definitely some improvements we could make to cache_upload's way of doing things, and those improvements would apply well to other clusters (especially cache_misc's, whose behavior is horrible under our current setup and services). But with the tradeoffs involved, it's very tricky to reason about and even more tricky to tune the magic values. The crux of the problem is that varnish's streaming has downsides for mixed client speeds.

In brief (without going into pages of logic behind it), compared to cache_upload's behavior today, I think we'd be served well by raising the do_stream limit on direct backends, streaming chunked transfers on indirect backends, and aligning the pass+stream limits on the frontends at a larger size limit than we use today on the remote FEs, and then applying that logic to all clusters.

However, the really good news is that I've dug around in Varnish 4 a bit, and it turns out that one of the lesser documented amazing improvements there is they've removed the downsides of do_stream (it now buffers correctly to de-couple concurrent client transfer speeds) and made do_stream default to on instead of off. In Varnish 4, do_stream is only turned off internally when necessary for things like ESI (which we don't use) and client-facing dynamic gzip/gunzip operations (which should be rare for normal clients+content). IMHO, rather than go through the significant risks and testing of improving our streaming code, we should just let this all block on Varnish 4 upgrades.

cache_misc is the one cluster most in need of immediate improvements here, and also happens to be next on our conversion list anyways.

I still think we should apply the BBT patch above (it's an improvement both now and under varnish4), and I'll add a note to our "Upgrade to Varnish 4: things to remember" ticket T126206 about removing the useless code in cache_misc/upload VCL along the way...

Change 281499 merged by BBlack:
VCL: remove all non-default between_bytes_timeout

https://gerrit.wikimedia.org/r/281499

akosiaris triaged this task as Medium priority.Apr 20 2016, 11:26 AM

Note: the work in T128813 looks pretty good, and is probably the general pattern we should follow, especially as clusters move to Varnish 4.

That being said, once a clusters steps at all into the realm of making content-length-based decisions, there are always tradeoffs, even under varnish 4. For misc and upload, we need to make those tradeoffs. For text and maps, it may be better to simply assume we do not normally have any response objects large enough to matter for thrashing frontend caches for now, instead of injecting the tradeoffs into the equation.

So, re-titling this so that we remember to fix it for cache_upload as we head into its Varnish 4 migration and blocker-ing it into that chain of tickets.

BBlack renamed this task from Solve large-object/stream/pass/chunked in our shared VCL to Solve large-object/stream/pass/chunked in upload cluster better.Apr 20 2016, 1:42 PM
BBlack updated the task description. (Show Details)
BBlack claimed this task.

So, we've reviewed the VCL and the Swift output, and the bottom line is this is a non-issue for cache_upload today and doesn't need to block conversion to Varnish4, because Swift outputs Content-Length, and also most of our size-based decisions are about do_stream=true, which become no-ops in Varnish4.