PHP fatal errors causing Varnish to return 503 - "Junk after gzip data"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krenair
	Feb 5 2016, 3:33 AM

Description

In Varnishlog for http://meta.wikimedia.beta.wmflabs.org/wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4 (and some other bugs that've since been fixed), this shows up:

30 ObjHeader    c Content-Type: text/html; charset=utf-8
30 FetchError   c Junk after gzip data
30 Gzip         c u F - 20 0 80 80 90
30 VCL_call     c error deliver
30 VCL_call     c deliver deliver
30 TxProtocol   c HTTP/1.1
30 TxStatus     c 503

If you request the page directly from apache with curl, you get an error Call to undefined method Message::toJson()
Note that this particular example requires being logged in, probably as a (un?)privileged user

P633 also showed something similar happening on such errors.

Details

Subject	Repo	Branch	Lines +/-
ATS: unset Accept-Encoding	operations/puppet	production	+11 -0
VCL: Do not request AE:gzip from applications	operations/puppet	production	+15 -0
Disable $wgUseGzip	operations/mediawiki-config	master	+3 -1

Customize query in gerrit

Related Objects

Mentioned In: T263288: experiment with reenabling compression between applayer's TLS terminators and edge caches
T263206: cr1-codfw<->cr1-eqiad link saturation
T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS)
T189966: Audit and simplify MediaWiki initialisation code (Spring 2018)
T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)
T125939: PHP fatal error about undefined method when I view a certain page
Mentioned Here: T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache)
T189892: varnish-be: rate of accepted sessions keeps on increasing
T107749: HTTP/1.1 keepalive for local nginx->varnish conns
P3194 Reproducing T125938
T89169: Log php fatals with full backtraces again (fatal.log on fluorine)
T124195: compressed http responses without content-length not cached by varnish
P633 fetch error in beta
T125939: PHP fatal error about undefined method when I view a certain page
T125432: Regression: action=info pages broken
P2552 action=info 503

Event Timeline

Krenair created this task.Feb 5 2016, 3:33 AM

Krenair raised the priority of this task from to Needs Triage.

Krenair updated the task description. (Show Details)

Krenair added projects: Traffic, Beta-Cluster-reproducible.

Krenair subscribed.

Restricted Application added a project: SRE. · View Herald TranscriptFeb 5 2016, 3:33 AM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Also showed up for T125432 - P2552

Krenair mentioned this in T125939: PHP fatal error about undefined method when I view a certain page.Feb 5 2016, 3:42 AM

Krenair updated the task description. (Show Details)Feb 5 2016, 3:46 AM

Krenair set Security to None.

also, https makes varnish return just a blank page (http 200), no 503

<halfak> I'm getting a 503 on Special:OAuthManageConsumers on Meta.

Sounds like this affects prod

I'm getting a 503 on Meta. Here's the copy-paste details:

Request from 10.128.0.118 via cp1068 cp1068 ([10.64.0.105]:3128), Varnish XID 1945312116
Forwarded for: 198.73.209.4, 10.128.0.117, 10.128.0.117, 10.128.0.118
Error: 503, Service Unavailable at Sat, 06 Feb 2016 02:25:30 GMT

Full error message is Fatal error: Call to undefined method Message::toJson() in /srv/mediawiki/php-1.27.0-wmf.12/extensions/OAuth/frontend/specialpages/SpecialMWOAuthManageConsumers.php on line 337 (and there is no stack trace - @bd808 did something in fatal logging break?)

This is apparently by design - if the user does not have permission, the DAO object returns a message. The idea would be that the user just sees a "this field is private" message without the higher-level code having to worry about access control, but the DAO framework also allows non-string return types via MWOAuthDAO::decodeRow and the two features don't go well together.

@Halfak if this is blocking you, you can probably get the same information by using Special:OAuthListConsumers instead.

Should have read T125939 first, duh.

curl -vs --raw -H 'Accept-Encoding: gzip' -H 'Host:meta.wikimedia.beta.wmflabs.org' -H 'Cookie: centralauth_User=Tgr; centralauth_Token=5263b3788d3bf08b76c90d5617c45966; centralauth_Session=ea9983f76254565161fc48c40126c5cd; metawikiUserID=55; metawikiUserName=Tgr; metawikiSession=vhmo634o7pq9q4fkuts41sflarfdmis6' 'http://deployment-mediawiki01/wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4'
* About to connect() to deployment-mediawiki01 port 80 (#0)
*   Trying 10.68.17.170... connected
> GET /wiki/Special:OAuthManageConsumers/proposed/1a31f6085491372db93394ed6cbff4c4 HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Accept: */*
> Accept-Encoding: gzip
> Host:meta.wikimedia.beta.wmflabs.org
> Cookie: centralauth_User=Tgr; centralauth_Token=5263b3788d3bf08b76c90d5617c45966; centralauth_Session=ea9983f76254565161fc48c40126c5cd; metawikiUserID=55; metawikiUserName=Tgr; metawikiSession=vhmo634o7pq9q4fkuts41sflarfdmis6
> 
< HTTP/1.1 200 OK
< Date: Sat, 06 Feb 2016 03:41:32 GMT
< Server: Apache
< X-Powered-By: HHVM/3.6.5
< X-Content-Type-Options: nosniff
< Cache-control: no-cache
< Content-Encoding: gzip
< Vary: Accept-Encoding
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=utf-8
< 
14
<U+001F><U+0003><U+0003>
2000
<...non-gzip-encoded HTML...>

HHVM can do its own gzip encryption, so presumably this is a bug with that.

Krenair updated the task description. (Show Details)Feb 6 2016, 5:05 AM

In T125938#2005023, @Tgr wrote:

HHVM can do its own gzip encryption, so presumably this is a bug with that.

In general, it's probably best to disable gzip output compression in the application layer outputs and let the cache layers do it. Basically if the applayer sends gzip content with TE: chunked (no content length), varnish doesn't cache it. This has caused pretty severe issues in some isolated cases in the past with Apache (which will send content length for small gzipped outputs, but switches to chunked beyond a certain size so that it can gzip while streaming the output). Ref T124195

In T125938#2005005, @Tgr wrote:

Full error message is Fatal error: Call to undefined method Message::toJson() in /srv/mediawiki/php-1.27.0-wmf.12/extensions/OAuth/frontend/specialpages/SpecialMWOAuthManageConsumers.php on line 337 (and there is no stack trace - @bd808 did something in fatal logging break?)

I only see these messages in hhvm.log. We never did figure out how to get that log stream to include a PHP stacktrace. We do seem to still be getting stacktraces in fatal.log, but as noted at T89169#1573731 our ability to trap fatals inside the interpreter can be disrupted if the error handler we use to do that is replaced with another handler. OAuth doesn't seem to call set_error_handler() directly so I'm not sure if that is what is happening here or not.

Andrew triaged this task as Medium priority.Apr 14 2016, 7:57 PM

In T125938#2005042, @BBlack wrote:

In general, it's probably best to disable gzip output compression in the application layer outputs and let the cache layers do it. Basically if the applayer sends gzip content with TE: chunked (no content length), varnish doesn't cache it. This has caused pretty severe issues in some isolated cases in the past with Apache (which will send content length for small gzipped outputs, but switches to chunked beyond a certain size so that it can gzip while streaming the output). Ref T124195

As long as I'm noticing this popping up in my email: the above turned out to be a myth, it's not generally true that our varnishes can't cache TE:chunked.

Is this still reproducible? Did we decide whether varnish or hhvm was at fault?

I just got this again after making MW fatal in beta. Varnish started sending 503s instead, and varnishlog showed "Junk after gzip data"

When you make MW fatal in beta, does hhvm emit some kind of valid error output, or is it truncating its output due to the fatal exception (possibly in the midst of an encoded gzip stream)?

My change seems to make MW fatal when Varnish contacts it but not when I try to curl. So it's difficult to tell.

Can you give me a way to reproduce it?

I just started typing instructions, went to test them (in my separate chrome session - which exists for this extra -WMF account in Phab) and realised the bug I was triggering was tied to my cookies, which explains why I didn't get it with curl earlier.

ssh root@deployment-tin
cd /srv/mediawiki-staging/php-master/extensions/VisualEditor
git reset --hard d70485244d4b192fd8382c9364a4c8dca80057ad
sudo -u jenkins-deploy scap sync-dir php-master/extensions/VisualEditor

Now curl like this:
curl -v -H "Accept-Encoding: gzip" -H "Host: en.wikipedia.beta.wmflabs.org" -H "Cookie: VEE=visualeditor; enwikiUserName=Krenair; centralauth_User=Krenair; centralauth_Session=SECRET" http://deployment-mediawiki01/wiki/Main_Page --compressed, where SECRET is replaced by the value in /home/krenair/T125938-sess
curl will fail if all of the pieces are aligned to trigger the bug I was dealing with - try again without the --compressed

Then to revert:

git pull origin master
sudo -u jenkins-deploy scap sync-dir php-master/extensions/VisualEditor

I put the results without the --compressed part of the curl command in P3194 - note the Content-Encoding: gzip response header and the first three characters of the response body (not shown in the browser) before <!DOCTYPE. I believe these characters are 0x1F, 0x03 and 0x03

Ok so, yeah, the repro is a little complicated, but this presumably isn't a varnish fault since hhvm's output is obviously-buggy. The next questions are probably:

Does this happen for all MW fatals? Or only certain kinds of fatals at certain stages of processing a response?
Is MW at fault here (e.g. for failing fatally in some inappropriate way and/or at an inappropriate critical time), or HHVM? I'm inclined to say HHVM is at fault, but who knows.
Does HHVM at least always close the connection after emitting the buggy output, even if the request to it was over an HTTP/1.1 connection that would otherwise keepalive? If it keeps the connection open after the buggy output, this could very well be what causes the slight 503 elevations seen every time we've tried to turn on HTTP/1.1 from nginx->varnish for the text cluster in T107749.

Yeah, those are definitely the right questions but I don't have any useful answers, and am not likely to have time to dig in further soon. I'm not sure if it affects all MW fatals, although I also suspect HHVM I haven't ruled MW out yet, and I have no idea about whether it always instantly closes the connection.

BBlack moved this task from Backlog to Radar/Not for service by Traffic on the Traffic board.Oct 4 2016, 1:39 PM

We saw this again today. There was a bug in TimedMediaHandler causing a 500 Internal Server Error only for (at least some) users with valid CentralAuth credentials in their cookies, for pages containing video content.

When reproducing internally against appservers.svc (no Traffic-layer involvement) by copying valid CA cookies from the browser, if curl was used with Accept-Encoding: none to avoid gzip, the 500 error would show correctly in the output. If curl --compressed was used, the result was corrupted gzip output (which Varnish then turns into a user-facing 503).

Given the number of issues we've seen over time with hhvm/php gzip output under various error conditions causing big confusion (by hiding interesting 500s behind 503s), I really think we should disable hhvm gzip output entirely. The cache layers will gzip gzippable content for the cache storage and the user regardless, without inducing this buggy confusion.

Change 314317 had a related patch set uploaded (by Chad):
Disable $wgUseGzip

https://gerrit.wikimedia.org/r/314317

gerritbot added a project: Patch-For-Review.Oct 5 2016, 6:33 PM

Change 314317 merged by jenkins-bot:
Disable $wgUseGzip

https://gerrit.wikimedia.org/r/314317

Mentioned in SAL (#wikimedia-operations) [2016-10-05T18:39:11Z] <demon@tin> Synchronized wmf-config/CommonSettings.php: disable gzip internally, T125938 (duration: 00m 50s)

^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out how to disable php/hhvm's zlib output compression.

Halfak edited subscribers, added: aaron; removed: Halfak.Oct 5 2016, 7:30 PM

Halfak removed a subscriber: aaron.

BBlack added a subscriber: ori.Oct 6 2016, 5:47 PM

If I understood hphp/runtime/server/transport.cpp#L989-L1008 correctly, compression would be disabled if BrotliCompressionEnabled, BrotliChunkedCompressionEnabled, and GzipCompressionLevel were all set to 0. Note that the ini-style variant for these config directives is lowercase_with_underscore.

In general, zlib supports a defined compression level of 0, which means "no compression", but is not the same as turning off compression. The data is still encoded by zlib and has relevant header data and whatnot, it just doesn't use a compression algorithm on the encoded data. So I would've thought zero would mean something different here, but the lines you're pointing at do seem to indicate it would turn it off completely, assuming enableCompression() controls it.

TheDJ removed a project: Patch-For-Review.May 16 2017, 11:38 AM

So, recapping this ticket that's been stale for quite a while:

We've had past applayer bugs with gzipped outputs in error cases, and I had rationalized that we should just disable all application-layer gzipping to avoid error edge cases, and let Varnish handle gzipping things on the way to its cache storage and the user:

In T125938#2694266, @BBlack wrote:

We saw this again today. There was a bug in TimedMediaHandler causing a 500 Internal Server Error only for (at least some) users with valid CentralAuth credentials in their cookies, for pages containing video content.

When reproducing internally against appservers.svc (no Traffic-layer involvement) by copying valid CA cookies from the browser, if curl was used with Accept-Encoding: none to avoid gzip, the 500 error would show correctly in the output. If curl --compressed was used, the result was corrupted gzip output (which Varnish then turns into a user-facing 503).

Given the number of issues we've seen over time with hhvm/php gzip output under various error conditions causing big confusion (by hiding interesting 500s behind 503s), I really think we should disable hhvm gzip output entirely. The cache layers will gzip gzippable content for the cache storage and the user regardless, without inducing this buggy confusion.

Then we merged a patch we thought would do that ( https://gerrit.wikimedia.org/r/#/c/314317/ ), and then it turned out it did nothing:

In T125938#2694407, @BBlack wrote:

^ The above turns out to be confusingly-named but unrelated. We still haven't quite figured out how to disable php/hhvm's zlib output compression.

At that point things trailed off and stalled back in 2016.

I'm back here today because I happened to notice some "odd" gzip output from MediaWiki. It's not invalid, just strange, and happens to be a case where Varnish is also behaving strange, so there could be a bug-triggering thing going on here that's subtle. The odd case in question is the 302 response from Special:Random. This response reasonably has no body output and normally emits Content-Length: 0, but if the UA (or Varnish) sends Accept-Encoding: gzip, these empty bodies become compressed, as in the response towards Varnish has Content-Encoding: gzip and 20 pointless body bytes which are a gzip header for an empty file.

I'd like to revive this ticket a bit, and pursue actually disabling gzip output at the application layer (in general would be nice, but the MW case is the most-important case), for internal services behind Varnish. There are at least a couple different ways we can approach this problem.

It would be simplest and least-buggy/confusing to actually disable gzip in Apache (and/or hhvm as appropriate). The potential downside is lack of compression for WMF-internal requests which don't flow through Varnish, but my current opinion is that isn't a huge issue.

Another option would be to leave the internal services' configs alone as they are, and hack on the Varnish side to have it never emit "AE: gzip" headers towards the application (but still accept CE:gzip from buggy servers, and still do its own gzipping of gzippable content into storage and on towards the user). That's not ideal either, but it's an option.

BBlack moved this task from Radar/Not for service by Traffic to Caching on the Traffic board.Mar 13 2018, 7:23 PM

Krinkle added a project: Performance-Team.Mar 14 2018, 7:40 PM

Krinkle moved this task from Inbox, needs triage to Radar on the Performance-Team board.Mar 19 2018, 8:07 PM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Mar 19 2018, 8:50 PM

Change 421267 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: Do not request AE:gzip from applications

https://gerrit.wikimedia.org/r/421267

gerritbot added a project: Patch-For-Review.Mar 22 2018, 11:36 AM

Change 421267 merged by BBlack:
[operations/puppet@production] VCL: Do not request AE:gzip from applications

https://gerrit.wikimedia.org/r/421267

The above took care of it from the Varnish side (seems easier for the case we care about!). It's been live for a bit over 24h now, and stats reflect that it's doing the right things (no obvious errors caused by it, network bandwidth increased from mw*->cp*, cp* CPU% went up a bit, cp* network output stayed about the same). It has probably cleared up some corner-case errors and/or inefficiencies as described in various places in this ticket, too.

It's hard to say whether this was a major factor in larger issues like T181315 or T189892 yet. Time will tell!

BBlack mentioned this in T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache).Mar 26 2018, 5:58 PM

Krinkle mentioned this in T189966: Audit and simplify MediaWiki initialisation code (Spring 2018).Aug 16 2018, 7:08 PM

Change 500011 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: unset AE:gzip

https://gerrit.wikimedia.org/r/500011

Change 500011 merged by Ema:
[operations/puppet@production] ATS: unset Accept-Encoding

https://gerrit.wikimedia.org/r/500011

• ema mentioned this in T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).Nov 29 2019, 3:01 PM

CDanis subscribed.Sep 18 2020, 1:36 PM

CDanis mentioned this in T263206: cr1-codfw<->cr1-eqiad link saturation.Sep 18 2020, 2:01 PM

CDanis mentioned this in T263288: experiment with reenabling compression between applayer's TLS terminators and edge caches.Sep 18 2020, 8:26 PM

PHP fatal errors causing Varnish to return 503 - "Junk after gzip data"Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

PHP fatal errors causing Varnish to return 503 - "Junk after gzip data"
Closed, ResolvedPublic
Actions