Page MenuHomePhabricator

PHP fatal errors causing Varnish to return 503 - "Junk after gzip data"
Closed, ResolvedPublic

Description

In Varnishlog for http://meta.wikimedia.beta.wmflabs.org/wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4 (and some other bugs that've since been fixed), this shows up:

30 ObjHeader    c Content-Type: text/html; charset=utf-8
30 FetchError   c Junk after gzip data
30 Gzip         c u F - 20 0 80 80 90
30 VCL_call     c error deliver
30 VCL_call     c deliver deliver
30 TxProtocol   c HTTP/1.1
30 TxStatus     c 503

If you request the page directly from apache with curl, you get an error Call to undefined method Message::toJson()
Note that this particular example requires being logged in, probably as a (un?)privileged user

P633 also showed something similar happening on such errors.

Event Timeline

Krenair raised the priority of this task from to Needs Triage.
Krenair updated the task description. (Show Details)
Krenair subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript
Krenair set Security to None.

also, https makes varnish return just a blank page (http 200), no 503

<halfak> I'm getting a 503 on Special:OAuthManageConsumers on Meta.

Sounds like this affects prod

I'm getting a 503 on Meta. Here's the copy-paste details:

Request from 10.128.0.118 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 1945312116
Forwarded for: 198.73.209.4, 10.128.0.117, 10.128.0.117, 10.128.0.118
Error: 503, Service Unavailable at Sat, 06 Feb 2016 02:25:30 GMT

Full error message is Fatal error: Call to undefined method Message::toJson() in /srv/mediawiki/php-1.27.0-wmf.12/extensions/OAuth/frontend/specialpages/SpecialMWOAuthManageConsumers.php on line 337 (and there is no stack trace - @bd808 did something in fatal logging break?)

This is apparently by design - if the user does not have permission, the DAO object returns a message. The idea would be that the user just sees a "this field is private" message without the higher-level code having to worry about access control, but the DAO framework also allows non-string return types via MWOAuthDAO::decodeRow and the two features don't go well together.

@Halfak if this is blocking you, you can probably get the same information by using Special:OAuthListConsumers instead.

Should have read T125939 first, duh.

curl -vs --raw -H 'Accept-Encoding: gzip' -H 'Host:meta.wikimedia.beta.wmflabs.org' -H 'Cookie: centralauth_User=Tgr; centralauth_Token=5263b3788d3bf08b76c90d5617c45966; centralauth_Session=ea9983f76254565161fc48c40126c5cd; metawikiUserID=55; metawikiUserName=Tgr; metawikiSession=vhmo634o7pq9q4fkuts41sflarfdmis6' 'http://deployment-mediawiki01/wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4'
* About to connect() to deployment-mediawiki01 port 80 (#0)
*   Trying 10.68.17.170... connected
> GET /wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4 HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Accept: */*
> Accept-Encoding: gzip
> Host:meta.wikimedia.beta.wmflabs.org
> Cookie: centralauth_User=Tgr; centralauth_Token=5263b3788d3bf08b76c90d5617c45966; centralauth_Session=ea9983f76254565161fc48c40126c5cd; metawikiUserID=55; metawikiUserName=Tgr; metawikiSession=vhmo634o7pq9q4fkuts41sflarfdmis6
> 
< HTTP/1.1 200 OK
< Date: Sat, 06 Feb 2016 03:41:32 GMT
< Server: Apache
< X-Powered-By: HHVM/3.6.5
< X-Content-Type-Options: nosniff
< Cache-control: no-cache
< Content-Encoding: gzip
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8
< 
14
<U+001F><U+0003><U+0003>
2000
<...non-gzip-encoded HTML...>

HHVM can do its own gzip encryption, so presumably this is a bug with that.

HHVM can do its own gzip encryption, so presumably this is a bug with that.

In general, it's probably best to disable gzip output compression in the application layer outputs and let the cache layers do it. Basically if the applayer sends gzip content with TE: chunked (no content length), varnish doesn't cache it. This has caused pretty severe issues in some isolated cases in the past with Apache (which will send content length for small gzipped outputs, but switches to chunked beyond a certain size so that it can gzip while streaming the output). Ref T124195

Full error message is Fatal error: Call to undefined method Message::toJson() in /srv/mediawiki/php-1.27.0-wmf.12/extensions/OAuth/frontend/specialpages/SpecialMWOAuthManageConsumers.php on line 337 (and there is no stack trace - @bd808 did something in fatal logging break?)

I only see these messages in hhvm.log. We never did figure out how to get that log stream to include a PHP stacktrace. We do seem to still be getting stacktraces in fatal.log, but as noted at T89169#1573731 our ability to trap fatals inside the interpreter can be disrupted if the error handler we use to do that is replaced with another handler. OAuth doesn't seem to call set_error_handler() directly so I'm not sure if that is what is happening here or not.

Andrew triaged this task as Medium priority.Apr 14 2016, 7:57 PM

In general, it's probably best to disable gzip output compression in the application layer outputs and let the cache layers do it. Basically if the applayer sends gzip content with TE: chunked (no content length), varnish doesn't cache it. This has caused pretty severe issues in some isolated cases in the past with Apache (which will send content length for small gzipped outputs, but switches to chunked beyond a certain size so that it can gzip while streaming the output). Ref T124195

As long as I'm noticing this popping up in my email: the above turned out to be a myth, it's not generally true that our varnishes can't cache TE:chunked.

Is this still reproducible? Did we decide whether varnish or hhvm was at fault?

I just got this again after making MW fatal in beta. Varnish started sending 503s instead, and varnishlog showed "Junk after gzip data"

When you make MW fatal in beta, does hhvm emit some kind of valid error output, or is it truncating its output due to the fatal exception (possibly in the midst of an encoded gzip stream)?

My change seems to make MW fatal when Varnish contacts it but not when I try to curl. So it's difficult to tell.

Can you give me a way to reproduce it?

I just started typing instructions, went to test them (in my separate chrome session - which exists for this extra -WMF account in Phab) and realised the bug I was triggering was tied to my cookies, which explains why I didn't get it with curl earlier.

ssh root@deployment-tin
cd /srv/mediawiki-staging/php-master/extensions/VisualEditor
git reset --hard d70485244d4b192fd8382c9364a4c8dca80057ad
sudo -u jenkins-deploy scap sync-dir php-master/extensions/VisualEditor

Now curl like this:
curl -v -H "Accept-Encoding: gzip" -H "Host: en.wikipedia.beta.wmflabs.org" -H "Cookie: VEE=visualeditor; enwikiUserName=Krenair; centralauth_User=Krenair; centralauth_Session=SECRET" http://deployment-mediawiki01/wiki/Main_Page --compressed, where SECRET is replaced by the value in /home/krenair/T125938-sess
curl will fail if all of the pieces are aligned to trigger the bug I was dealing with - try again without the --compressed

Then to revert:

git pull origin master
sudo -u jenkins-deploy scap sync-dir php-master/extensions/VisualEditor

I put the results without the --compressed part of the curl command in P3194 - note the Content-Encoding: gzip response header and the first three characters of the response body (not shown in the browser) before <!DOCTYPE. I believe these characters are 0x1F, 0x03 and 0x03

Ok so, yeah, the repro is a little complicated, but this presumably isn't a varnish fault since hhvm's output is obviously-buggy. The next questions are probably:

  1. Does this happen for all MW fatals? Or only certain kinds of fatals at certain stages of processing a response?
  2. Is MW at fault here (e.g. for failing fatally in some inappropriate way and/or at an inappropriate critical time), or HHVM? I'm inclined to say HHVM is at fault, but who knows.
  3. Does HHVM at least always close the connection after emitting the buggy output, even if the request to it was over an HTTP/1.1 connection that would otherwise keepalive? If it keeps the connection open after the buggy output, this could very well be what causes the slight 503 elevations seen every time we've tried to turn on HTTP/1.1 from nginx->varnish for the text cluster in T107749.

Yeah, those are definitely the right questions but I don't have any useful answers, and am not likely to have time to dig in further soon. I'm not sure if it affects all MW fatals, although I also suspect HHVM I haven't ruled MW out yet, and I have no idea about whether it always instantly closes the connection.

We saw this again today. There was a bug in TimedMediaHandler causing a 500 Internal Server Error only for (at least some) users with valid CentralAuth credentials in their cookies, for pages containing video content.

When reproducing internally against appservers.svc (no Traffic-layer involvement) by copying valid CA cookies from the browser, if curl was used with Accept-Encoding: none to avoid gzip, the 500 error would show correctly in the output. If curl --compressed was used, the result was corrupted gzip output (which Varnish then turns into a user-facing 503).

Given the number of issues we've seen over time with hhvm/php gzip output under various error conditions causing big confusion (by hiding interesting 500s behind 503s), I really think we should disable hhvm gzip output entirely. The cache layers will gzip gzippable content for the cache storage and the user regardless, without inducing this buggy confusion.

Change 314317 had a related patch set uploaded (by Chad):
Disable $wgUseGzip

https://gerrit.wikimedia.org/r/314317

Change 314317 merged by jenkins-bot:
Disable $wgUseGzip

https://gerrit.wikimedia.org/r/314317

Mentioned in SAL (#wikimedia-operations) [2016-10-05T18:39:11Z] <demon@tin> Synchronized wmf-config/CommonSettings.php: disable gzip internally, T125938 (duration: 00m 50s)

^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out how to disable php/hhvm's zlib output compression.

Halfak removed a subscriber: aaron.

If I understood hphp/runtime/server/transport.cpp#L989-L1008 correctly, compression would be disabled if BrotliCompressionEnabled, BrotliChunkedCompressionEnabled, and GzipCompressionLevel were all set to 0. Note that the ini-style variant for these config directives is lowercase_with_underscore.

In general, zlib supports a defined compression level of 0, which means "no compression", but is not the same as turning off compression. The data is still encoded by zlib and has relevant header data and whatnot, it just doesn't use a compression algorithm on the encoded data. So I would've thought zero would mean something different here, but the lines you're pointing at do seem to indicate it would turn it off completely, assuming enableCompression() controls it.

So, recapping this ticket that's been stale for quite a while:

  • We've had past applayer bugs with gzipped outputs in error cases, and I had rationalized that we should just disable all application-layer gzipping to avoid error edge cases, and let Varnish handle gzipping things on the way to its cache storage and the user:

We saw this again today. There was a bug in TimedMediaHandler causing a 500 Internal Server Error only for (at least some) users with valid CentralAuth credentials in their cookies, for pages containing video content.

When reproducing internally against appservers.svc (no Traffic-layer involvement) by copying valid CA cookies from the browser, if curl was used with Accept-Encoding: none to avoid gzip, the 500 error would show correctly in the output. If curl --compressed was used, the result was corrupted gzip output (which Varnish then turns into a user-facing 503).

Given the number of issues we've seen over time with hhvm/php gzip output under various error conditions causing big confusion (by hiding interesting 500s behind 503s), I really think we should disable hhvm gzip output entirely. The cache layers will gzip gzippable content for the cache storage and the user regardless, without inducing this buggy confusion.

Then we merged a patch we thought would do that ( https://gerrit.wikimedia.org/r/#/c/314317/ ), and then it turned out it did nothing:

^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out how to disable php/hhvm's zlib output compression.

At that point things trailed off and stalled back in 2016.


I'm back here today because I happened to notice some "odd" gzip output from MediaWiki. It's not invalid, just strange, and happens to be a case where Varnish is also behaving strange, so there could be a bug-triggering thing going on here that's subtle. The odd case in question is the 302 response from Special:Random. This response reasonably has no body output and normally emits Content-Length: 0, but if the UA (or Varnish) sends Accept-Encoding: gzip, these empty bodies become compressed, as in the response towards Varnish has Content-Encoding: gzip and 20 pointless body bytes which are a gzip header for an empty file.

I'd like to revive this ticket a bit, and pursue actually disabling gzip output at the application layer (in general would be nice, but the MW case is the most-important case), for internal services behind Varnish. There are at least a couple different ways we can approach this problem.

It would be simplest and least-buggy/confusing to actually disable gzip in Apache (and/or hhvm as appropriate). The potential downside is lack of compression for WMF-internal requests which don't flow through Varnish, but my current opinion is that isn't a huge issue.

Another option would be to leave the internal services' configs alone as they are, and hack on the Varnish side to have it never emit "AE: gzip" headers towards the application (but still accept CE:gzip from buggy servers, and still do its own gzipping of gzippable content into storage and on towards the user). That's not ideal either, but it's an option.

Change 421267 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: Do not request AE:gzip from applications

https://gerrit.wikimedia.org/r/421267

Change 421267 merged by BBlack:
[operations/puppet@production] VCL: Do not request AE:gzip from applications

https://gerrit.wikimedia.org/r/421267

BBlack claimed this task.

The above took care of it from the Varnish side (seems easier for the case we care about!). It's been live for a bit over 24h now, and stats reflect that it's doing the right things (no obvious errors caused by it, network bandwidth increased from mw*->cp*, cp* CPU% went up a bit, cp* network output stayed about the same). It has probably cleared up some corner-case errors and/or inefficiencies as described in various places in this ticket, too.

It's hard to say whether this was a major factor in larger issues like T181315 or T189892 yet. Time will tell!

Change 500011 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: unset AE:gzip

https://gerrit.wikimedia.org/r/500011

Change 500011 merged by Ema:
[operations/puppet@production] ATS: unset Accept-Encoding

https://gerrit.wikimedia.org/r/500011