Page MenuHomePhabricator

Varnish seems to sometimes mangle uncompressed API results
Closed, ResolvedPublic


Compare the last characters of:

curl -H 'Host:' -H 'X-Forwarded-Proto: https' 'http://mw1017/w/api.php?action=wbgetentities&redirects=no&props=claims&format=json&ids=Q18407428|Q18407424|Q18407421|Q18407418|Q18407262|Q18407245|Q18407242|Q18407240|Q18407236|Q18407234|Q18407233|Q18407228|Q18407225|Q18407224|Q18407221|Q18407218|Q18407127|Q18407044|Q18407038|Q18407009|Q18407008|Q18407006|Q18407005|Q18407004'


curl '|Q18407424|Q18407421|Q18407418|Q18407262|Q18407245|Q18407242|Q18407240|Q18407236|Q18407234|Q18407233|Q18407228|Q18407225|Q18407224|Q18407221|Q18407218|Q18407127|Q18407044|Q18407038|Q18407009|Q18407008|Q18407006|Q18407005|Q18407004'

This only happens if we ask for the non-gziped result.

I can't see any obviously wrong headers here, so I'm assuming Varnish is to blame.

Event Timeline

Do you know if some normal traffic is affected, such that we'd know a start date for a recent change in behavior? Or is it suspected that it was always this way?

I've been digging through some debugging on this URL (which is an applayer chunked-response with no-cache headers), and it's definitely happening at the varnish<->MW boundary (as opposed to further up, at varnish<->varnish or nginx<->varnish), and only for non-AE:gzip requests. The length of the result is correct, but there's corruption in the trailing bytes.

Do you know if some normal traffic is affected, such that we'd know a start date for a recent change in behavior? Or is it suspected that it was always this way?

I'm not aware of normal traffic being affected, it possibly started around 2016-04-11 as per

Just jotting down the things I know so far from investigating this morning. I still don't have a good answer yet.

Based on just the test URL, debugging it extensively at various layers:

  1. The response size of that URL is in the ballpark of 32KB uncompressed, so this is not a large-objects issue. It also streams out of the backend reliably and quickly, there are no significant timing pauses.
  2. Without a doubt, anytime the client doesn't use AE:gzip when talking to the public endpoint, the response is corrupt.
  3. Slightly deeper, even when testing requests against a single layer of varnish internally, the response to a non-AE:gzip client request is corrupt.
  4. It's definitely happening at the Varnish<->Apache/MediaWiki boundary (disagreement) or within a single varnishd process as it prepares the response. It's not a Varnish<->Varnish, Varnish<->Nginx, or Nginx<->client issue.
  5. All of the gzip/chunked stuff looks basically correct in the headers at varnish/apache boundary and varnish/client boundary. We do send AE:gzip when expected, we do get CE:gzip when expected (only when asked for), the gzip-ness of MW/Apache's output always correctly follows its CE:gzip header (or lack thereof), etc.
  6. Curl has no problem parsing the output of a direct fetch from Apache/MediaWiki, whether using --compressed to set AE:gzip or not, and the results hash the same (identical content).
  7. Varnish emits the corrupt content for the non-AE:gzip client regardless of whether I tweak the test scenario to ensure that varnish is the one gzipping the content, or that Apache/Mediawiki are the ones gzipping the content. So it's not an error in gzip compression of the response by just one party or the other. The error happens when gunzipping the response for the non-AE:gzip client.
  8. However, when I run through a similar set of fully-debugged test scenarios for , which is ~1MB in uncompressed length, and similarly TE:chunked with backend gzip capabilities and do_gzip=true, and on the same cluster and VCL (and even same test machine), I don't get the corruption for a non-AE:gzip client, even though varnish is decompressing that on the fly as with the bad test URL.

The obvious distinctions here between the Barack article and the failing test URL aren't much: api.svc vs appservers.svc shouldn't matter, right, they're both behind the same apache and hhvm configs. This leaves me guessing that there's something special about the specific output of the test URL that's causing this.

There's almost certainly a varnish bug involved here, but the question is: is this a pure varnish gunzip bug that's sensitive to certain conditions which exist for the test URL but not the Barack one? Is the output of Apache/MW buggy in some way for the test URL such that it's tripping the bug (in which case it's still a varnish bug that it doesn't reject the buggy response and turn it into a 503 or similar), or is it non-buggy, but "special" in a way that trips a varnish bug?

Did some further testing on an isolated test machine, using our current varnish3 package.

  • Got 2833-byte test file from uncorrupted (CE:gzip) output on prod. This is the exact compressed content bytes emitted by MW/Apache for the broken test URL.
  • Configured a test backend (nginx) to serve static files, and to always set CE:gzip.
  • Placed the gzipped 2833 byte file in test directory, fetched over curl w/ --compressed, md5sum comes out right.
  • When fetched through our varnishd with a default config and do_gzip turned on, varnish does decompress this file for the curl client, and there is no corruption (still same md5sum).

This rules out the possibility that this is some pure, data-sensitive varnish bug with gunzipping the content itself. However, the notable diff in this test from reality is that nginx serving a static pre-gzipped file is (a) not emitting it as TE:chunked and (b) even if it did, it probably wouldn't use the same chunk boundaries, nor is it likely to share any TE:chunked encoding bugs or varnish-bug-triggers...

Thanks for merging in the probably-related tasks. I had somehow missed really noticing T123159 earlier... So probably digging into gunzip itself isn't a fruitful path. I'm going to open a separate blocker for this that's private, so we can keep merging public tickets into this...

BBlack triaged this task as High priority.May 7 2016, 3:49 PM

Using the exact URL from the task description, I'm consistently getting a bad response:

>>> from subprocess import check_output
>>> args = ['curl', '-s', '|Q18407424|Q18407421|Q18407418|Q18407262|Q18407245|Q18407242|Q18407240|Q18407236|Q18407234|Q18407233|Q18407228|Q18407225|Q18407224|Q18407221|Q18407218|Q18407127|Q18407044|Q18407038|Q18407009|Q18407008|Q18407006|Q18407005|Q18407004']
>>> output = check_output(args)
>>> output[-30:]

I got called away while testing. Using the same URL later, I'm now getting a different bad response:

>>> output[-30:]

I thought both api.php request URLs and responses were typically not cached. However, if I append a cache-buster (in this case, &4), I get non-corrupt output:

>>> from subprocess import check_output
>>> args = ['curl', '-s', '|Q18407424|Q18407421|Q18407418|Q18407262|Q18407245|Q18407242|Q18407240|Q18407236|Q18407234|Q18407233|Q18407228|Q18407225|Q18407224|Q18407221|Q18407218|Q18407127|Q18407044|Q18407038|Q18407009|Q18407008|Q18407006|Q18407005|Q18407004&4']
>>> output = check_output(args)
>>> output[-30:]


It also seems to be related to the size of the response. 32kB is the critical size. The response size is 32766=0x7FFE bytes.

Copying @Anomie's comment from T132159:

While investigating T132123, I discovered that responses with lengths near to multiples of 32768 bytes will have their final bytes corrupted. A response with a length exactly 32768 bytes will have the final 8 bytes corrupted (in my test, they are replaced with 27 a5 ae 23 8b 7f 00 00). A file with length 1–7 bytes less than such a multiple will have the final 7–1 bytes corrupted.

Steps to reproduce:

  1. Create a file with a length that is a multiple of 32768 someplace where it will be served by our infrastructure, e.g. using dd if=/dev/zero of=/srv/mediawiki/w/test.txt bs=1 count=32768 on mw1017.
  2. Fetch the file from the public Internet without using compression, e.g. using curl --header 'X-Wikimedia-Debug: backend=mw1017.eqiad.wmnet' -v ""

I've tested this with backends mw1017, mw1099, and mw2099 (frontend via eqiad) with identical results. Going through different load balancers (e.g. using curl -k --header 'Host:' --header 'X-Wikimedia-Debug: backend=mw2099.codfw.wmnet' -v "") also produces the corruption, but the specific values of the corrupted bytes are different.

I did len(output) during my testing in T133866#2272654. The bad response (using the exact URL provided in this task) has a length of 32766 bytes, while the good response (using a cache-buster) has a length of 32822 bytes.

BBlack closed subtask Restricted Task as Resolved.May 9 2016, 1:28 PM

So, as it turns out, this is a general varnishd bug in our specific varnishd build. For purposes of this bug, our varnishd code is essentially 3.0.7 plus a bunch of ancient forward-ported 'plus' patches related to streaming, and we're missing (do_gzip + do_stream content-length bug for HTTP/1.0 reqs, which is eerily similar to this issue, but not quite the same) because it doesn't apply cleanly/sanely to our codebase due to conflicts with the former.

What I can reliably and predictably observe and control for now is: we have a response-length-specific response corruption bug, only when both of these conditions are met:

  1. do_stream is in effect for this request (for text cluster, this means it's pass or initial miss+chfp(Created-Hit-For-Pass) traffic)
  2. the response has to be gunzipped for the client (client does not advertise gzip support, but backend response is gzipped by the applayer, or gzipped by varnish due to do_gzip rules).

In a lot of the test scenarios/requests myself and others were using previously, we weren't necessarily controlling for these variables well, which led to a lot of inconsistent results (notably, X-Wikimedia-Debug effectively turns non-pass traffic into pass-traffic when debugging, but the same might not be true if testing directly from varnish to mw1017 without X-Wikimedia-Debug).

The do_gzip (and related gunzip) behaviors have been in place for a long time. What's new lately is the do_stream behaviors. These were added to the cache_text cluster in the past couple of months for the pass-traffic cases. cache_upload has had do_stream for certain requests for a very long time, but various constraints there conspire to make it accidentally-unlikely we'll observe this bug on cache_upload for legitimate traffic. cache_misc probably suffers from this as well, but the conditions under which it will or won't is trickier in this case, but almost surely this is related to T133490 as well.

So the basic game plan for this bug is:
cache_text - revert the relatively-recent do_stream-enabling VCL patches.
cache_misc - will resolve itself with varnish4 upgrade, which is imminent for this cluster
cache_upload - keep ignoring what is probably a non-problem in practice there for now, will eventually get fixed up with varnish 4 upgrade.
cache_maps - already varnish4, wouldn't have this issue.

Change 287623 had a related patch set uploaded (by BBlack):
Remove X-Pass-Stream support from cache_misc size stream VCL

Change 287624 had a related patch set uploaded (by BBlack):
Revert "text VCL: do_stream when creating hit-for-pass"

Change 287625 had a related patch set uploaded (by BBlack):
Revert "Varnish: protect against external streampass header setting"

Change 287626 had a related patch set uploaded (by BBlack):
Revert "Varnish: stream all pass traffic"

Change 287623 merged by BBlack:
Remove X-Pass-Stream support from cache_misc size stream VCL

Change 287624 merged by BBlack:
Revert "text VCL: do_stream when creating hit-for-pass"

Change 287625 merged by BBlack:
Revert "Varnish: protect against external streampass header setting"

Change 287626 merged by BBlack:
Revert "Varnish: stream all pass traffic"

Change 287633 had a related patch set uploaded (by BBlack):
cache_misc: remove all do_stream=true

Change 287633 merged by BBlack:
cache_misc: remove all do_stream=true

My test cases on cache_text work now, should be resolved!