Page MenuHomePhabricator

WDQS empty response - transfer clsoed with 15042 bytes remaining to read
Closed, ResolvedPublic

Description

At seemingly random intervals, WDQS responses are truncated after headers (no body is sent). Example of the error:

$ curl -v 'https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fi%20WHERE%20{%0A%20%20%20%3Fi%20wdt%3AP31%2Fwdt%3AP279*%20wd%3AQ44%20.%0A}'
*   Trying 91.198.174.217...
* Connected to query.wikidata.org (91.198.174.217) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 704 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_ECDSA_AES_128_GCM_SHA256
* 	 server certificate verification OK
* 	 server certificate status verification SKIPPED
* 	 common name: *.wikipedia.org (matched)
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: EC
* 	 certificate version: #3
* 	 subject: C=US,ST=California,L=San Francisco,O=Wikimedia Foundation\, Inc.,CN=*.wikipedia.org
* 	 start date: Thu, 10 Dec 2015 23:22:05 GMT
* 	 expire date: Sat, 10 Dec 2016 22:46:04 GMT
* 	 issuer: C=BE,O=GlobalSign nv-sa,CN=GlobalSign Organization Validation CA - SHA256 - G2
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
> GET / HTTP/1.1
> Host: query.wikidata.org
> User-Agent: curl/7.43.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 11 May 2016 12:31:07 GMT
< Content-Type: text/html
< Content-Length: 15042
< Connection: keep-alive
< Server: nginx/1.9.4
< Last-Modified: Sat, 07 May 2016 20:29:34 GMT
< ETag: W/"572e502e-3ac2"
< X-Served-By: wdqs1002
< X-Varnish: 1682612771 1682600741, 170952985 170935056, 25959566 29460572
< Via: 1.1 varnish, 1.1 varnish-v4, 1.1 varnish-v4
< Vary: Accept-Encoding
< Age: 11
< X-Cache: cp1061 hit(2), cp3007 hit(1), cp3008 frontend hit(1)
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Set-Cookie: WMF-Last-Access=11-May-2016;Path=/;HttpOnly;secure;Expires=Sun, 12 Jun 2016 12:00:00 GMT
< X-Analytics: https=1;nocookies=1
< X-Client-IP: 178.196.145.38
< Accept-Ranges: bytes
< 
* transfer closed with 15042 bytes remaining to read
* Closing connection 0
curl: (18) transfer closed with 15042 bytes remaining to read

Related Objects

StatusAssignedTask
DuplicateNone
DuplicateNone
OpenNone
OpenNone
DuplicateNikerabbit
Opendaniel
DuplicateBBlack
StalledNone
OpenNone
OpenNone
ResolvedYurik
ResolvedSmalyshev
OpenNone
Resolvedema
DeclinedNone
Resolvedema
ResolvedBBlack
ResolvedBBlack
ResolvedBBlack
Resolvedema
Resolvedema

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
BBlack added a subscriber: BBlack.May 11 2016, 1:39 PM

Assuming there was no transient issue (which became cached) on the wdqs end of things, then this was likely a transient thing from nginx experiments or the cache_misc varnish4 upgrade. I banned all wdqs objects from cache_misc and now your test URL works fine. Can you confirm?

Addshore moved this task from incoming to monitoring on the Wikidata board.May 11 2016, 1:47 PM
Gehel triaged this task as High priority.May 11 2016, 1:51 PM
Gehel claimed this task.
Gehel added a comment.May 11 2016, 1:54 PM

I can't reproduce the issue. Since it was random, I might just be lucky. Closing this anyway for the moment, I'll reopen if I see the issue again.

Gehel closed this task as Resolved.May 11 2016, 1:54 PM
Gehel reopened this task as Open.May 11 2016, 2:09 PM
Gehel added a comment.May 11 2016, 2:11 PM

Seems the issue (or a similar one is still happening), I'm trying to reproduce...

Change 288213 had a related patch set uploaded (by BBlack):
tlsproxy: use http/1.1 only in conjunction with varnish4

https://gerrit.wikimedia.org/r/288213

Change 288213 merged by BBlack:
tlsproxy: use http/1.1 only in conjunction with varnish4

https://gerrit.wikimedia.org/r/288213

I can confirm I am still having read errors every few requests.

curl -v 'https://query.wikidata.org/vendor/danml/download.min.js'
* Hostname was NOT found in DNS cache
*   Trying 91.198.174.217...
* Connected to query.wikidata.org (91.198.174.217) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-ECDSA-AES128-GCM-SHA256
* Server certificate:
* 	 subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
* 	 start date: 2015-12-10 23:22:05 GMT
* 	 expire date: 2016-12-10 22:46:04 GMT
* 	 subjectAltName: query.wikidata.org matched
* 	 issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
* 	 SSL certificate verify ok.
> GET /vendor/danml/download.min.js HTTP/1.1
> User-Agent: curl/7.35.0
> Host: query.wikidata.org
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 11 May 2016 19:23:44 GMT
< Content-Type: application/javascript
< Transfer-Encoding: chunked
< Connection: keep-alive
* Server nginx/1.9.4 is not blacklisted
< Server: nginx/1.9.4
< Last-Modified: Tue, 26 Apr 2016 13:46:58 GMT
< X-Served-By: wdqs1001
< ETag: W/"571f7152-764"
< Vary: Accept-Encoding
< X-Varnish: 16729101 16729065, 16403773 16758912, 8447966
< Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
< Age: 114
< Accept-Ranges: bytes
< Age: 114
< X-Cache: cp1051 hit(1), cp3007 hit(4), cp3009 frontend pass(0)
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Set-Cookie: WMF-Last-Access=11-May-2016;Path=/;HttpOnly;secure;Expires=Sun, 12 Jun 2016 12:00:00 GMT
< X-Analytics: https=1;nocookies=1
< X-Client-IP: 78.55.227.77
< Accept-Ranges: bytes
< 
* SSLv3, TLS alert, Client hello (1):
* transfer closed with outstanding read data remaining
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
curl: (18) transfer closed with outstanding read data remaining
hashar added subscribers: hashar, KDDLB, Urbanecm and 3 others.

From T135038:

curl -I https://releases.wikimedia.org/mediawiki/1.26/mediawiki-1.26.2.tar.gz

Example response headers (bad)

HTTP/1.1 200 OK
Date: Wed, 11 May 2016 19:40:30 GMT
Content-Type: application/x-gzip
Content-Length: 0
Connection: keep-alive
Server: Apache
ETag: "18026c7-5275e33fa3d00"
Last-Modified: Mon, 21 Dec 2015 01:13:56 GMT
Backend-Timing: D=304 t=1462988431650045
X-Varnish: 4106268 42887, 19455418, 4857682
Age: 152
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
Age: 152
Accept-Ranges: bytes
Age: 152
X-Cache: cp1045 hit(4), cp3007 miss(0), cp3008 frontend miss(0)
Strict-Transport-Security: max-age=31536000
Set-Cookie: WMF-Last-Access=11-May-2016;Path=/;HttpOnly;secure;Expires=Sun, 12 Jun 2016 12:00:00 GMT
X-Analytics: https=1;nocookies=1
X-Client-IP: 91.218.200.230
Accept-Ranges: bytes

Example response headers (good)

HTTP/1.1 200 OK
Date: Wed, 11 May 2016 19:41:08 GMT
Content-Type: application/x-gzip
Content-Length: 25175751
Connection: keep-alive
Server: Apache
ETag: "18026c7-5275e33fa3d00"
Last-Modified: Mon, 21 Dec 2015 01:13:56 GMT
Backend-Timing: D=304 t=1462988431650045
X-Varnish: 368674 42887, 19377459 18598167, 16596255
Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
Age: 190
Accept-Ranges: bytes
Age: 190
X-Cache: cp1045 hit(1), cp3010 hit(1), cp3008 frontend pass(0)
Strict-Transport-Security: max-age=31536000
Set-Cookie: WMF-Last-Access=11-May-2016;Path=/;HttpOnly;secure;Expires=Sun, 12 Jun 2016 12:00:00 GMT
X-Analytics: https=1;nocookies=1
X-Client-IP: 91.218.200.230
Accept-Ranges: bytes

All of the bad responses I've seen so far have X-Cache: cp1045 hit(4), cp3007 miss(0), cp3008 frontend miss(0).

Word diff wdiff bad good|colordiff:

Unified diff:

--- bad	2016-05-11 21:57:43.000000000 +0200
+++ good	2016-05-11 21:57:51.000000000 +0200
@@ -1,21 +1,20 @@
 HTTP/1.1 200 OK
-Date: Wed, 11 May 2016 19:40:30 GMT
+Date: Wed, 11 May 2016 19:41:08 GMT
 Content-Type: application/x-gzip
-Content-Length: 0
+Content-Length: 25175751
 Connection: keep-alive
 Server: Apache
 ETag: "18026c7-5275e33fa3d00"
 Last-Modified: Mon, 21 Dec 2015 01:13:56 GMT
 Backend-Timing: D=304 t=1462988431650045
-X-Varnish: 4106268 42887, 19455418, 4857682
-Age: 152
+X-Varnish: 368674 42887, 19377459 18598167, 16596255
 Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
-Age: 152
+Age: 190
 Accept-Ranges: bytes
-Age: 152
-X-Cache: cp1045 hit(4), cp3007 miss(0), cp3008 frontend miss(0)
+Age: 190
+X-Cache: cp1045 hit(1), cp3010 hit(1), cp3008 frontend pass(0)
 Strict-Transport-Security: max-age=31536000
 Set-Cookie: WMF-Last-Access=11-May-2016;Path=/;HttpOnly;secure;Expires=Sun, 12 Jun 2016 12:00:00 GMT
 X-Analytics: https=1;nocookies=1
 X-Client-IP: 91.218.200.230
 Accept-Ranges: bytes

Status update: we've been debugging this off and on all day. It's some kind of bug fallout from cache_misc's upgrade to Varnish 4. It's a very complicated bug, and we don't really understand it yet. We've made some band-aid fixes to VCL for now which should keep the problem at bay while investigating further.

Change 288350 had a related patch set uploaded (by Ema):
Add debian/patches/0005-handle-eof-http1.1.patch

https://gerrit.wikimedia.org/r/288350

Change 288350 merged by Ema:
Add debian/patches/0005-handle-eof-http1.1.patch

https://gerrit.wikimedia.org/r/288350

Mentioned in SAL [2016-05-12T07:05:38Z] <ema> running varnish 4.1.2-1wm3 in misc esams (T134989)

Gehel removed Gehel as the assignee of this task.May 12 2016, 7:32 AM

Mentioned in SAL [2016-05-12T07:47:03Z] <ema> upgrading misc codfw to 4.1.2-1wm3 and wiping caches (T134989)

Mentioned in SAL [2016-05-12T07:51:40Z] <ema> upgrading misc ulsfo to varnish 4.1.2-1wm3 and wiping caches (T134989)

ema added a subscriber: ema.May 12 2016, 8:04 AM

Mentioned in SAL [2016-05-12T08:33:24Z] <ema> wiping misc caches once again (T134989)

Now I get content-length 0 for query.wikidata.org

jonkr@C134:~$ curl -v 'https://query.wikidata.org/'
* Hostname was NOT found in DNS cache
*   Trying 91.198.174.217...
* Connected to query.wikidata.org (91.198.174.217) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-ECDSA-AES128-GCM-SHA256
* Server certificate:
* 	 subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
* 	 start date: 2015-12-10 23:22:05 GMT
* 	 expire date: 2016-12-10 22:46:04 GMT
* 	 subjectAltName: query.wikidata.org matched
* 	 issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
* 	 SSL certificate verify ok.
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: query.wikidata.org
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Thu, 12 May 2016 11:52:00 GMT
< Content-Type: text/html
< Content-Length: 0
< Connection: keep-alive
* Server nginx/1.9.4 is not blacklisted
< Server: nginx/1.9.4
< Last-Modified: Sat, 07 May 2016 20:29:34 GMT
< ETag: "572e502e-3ac2"
< X-Served-By: wdqs1002
< Content-Encoding: gzip
< Vary: Accept-Encoding
< X-Varnish: 26641945, 25319777, 27963618 25227953
< Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
< Age: 57
< X-Cache: cp1061 hit+miss(0), cp3007 hit+miss(0), cp3007 frontend hit(5)
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Set-Cookie: WMF-Last-Access=12-May-2016;Path=/;HttpOnly;secure;Expires=Mon, 13 Jun 2016 00:00:00 GMT
< X-Analytics: https=1;nocookies=1
< X-Client-IP: 87.138.110.76
< Accept-Ranges: bytes
< 
* Connection #0 to host query.wikidata.org left intact

Change 288387 had a related patch set uploaded (by Ema):
Add debian/patches/0006-IMS-duplicate-headers-proper-handling.patch

https://gerrit.wikimedia.org/r/288387

Change 288387 merged by Ema:
Add debian/patches/0006-IMS-duplicate-headers-proper-handling.patch

https://gerrit.wikimedia.org/r/288387

Mentioned in SAL [2016-05-12T13:34:23Z] <ema> upgrading misc to varnish 4.1.2-1wm4 and wiping caches (T134989)

BBlack added a subscriber: Kghbln.

In the merged ticket above, it's browser access to status.wm.o, and the browser's getting a 304 Not Modified and complaining about it (due to missing character encoding supposedly, but it's entirely likely it's missing everything and that's just the first thing it notices).

due to missing character encoding supposedly, but it's entirely likely it's missing everything and that's just the first thing it notices.

I believe that this observation is correct. Thumbs up for a fix!

Mentioned in SAL [2016-05-12T15:52:43Z] <ema> testing file storage on misc eqiad+esams T134989

Kghbln removed a subscriber: Kghbln.May 12 2016, 3:59 PM

So we're currently have several experiments in play trying to figure this out:

  1. We've got 2x upstream bugfixes applied to our varnishd on cache_misc: https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 + https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690
  1. We've removed all of our Content-Length sensitive VCL that was on cache_misc temporarily (basically https://gerrit.wikimedia.org/r/#/c/288231/ , which at one point we partially put back, but then removed again)
  1. We've switched from persistent to file storage on the misc backends, manually with puppet disabled. Puppetization to make that semi-permanent if need be: https://gerrit.wikimedia.org/r/288440 (untested)

I don't think anyone has reproduced the problem since (3) went live everywhere. So we're in a new state and needing proof that things are still messed up (or not!).

Has anyone been able to reproduce any of the problems in the tickets merged into here, since roughly the timestamp of the above message?

I don't see any weirdness anymore on query.wikidata.org/ @Jonas ?

Problem seems to be fixed. Thank you!

ema added a comment.May 13 2016, 7:44 AM
  1. We've got 2x upstream bugfixes applied to our varnishd on cache_misc: https://github.com/varnishcache/varnish-cache/commit/d828a042b3fc2c2b4f1fea83021f0d5508649e50 + https://github.com/varnishcache/varnish-cache/commit/e142a199c53dd9331001cb29678602e726a35690
  2. We've removed all of our Content-Length sensitive VCL that was on cache_misc temporarily (basically https://gerrit.wikimedia.org/r/#/c/288231/ , which at one point we partially put back, but then removed again)
  3. We've switched from persistent to file storage on the misc backends, manually with puppet disabled. Puppetization to make that semi-permanent if need be: https://gerrit.wikimedia.org/r/288440 (untested)

    I don't think anyone has reproduced the problem since (3) went live everywhere. So we're in a new state and needing proof that things are still messed up (or not!).

I'll keep on trying to reproduce the issue today, and ask others to do the same, including @Gehel whose tests have been really helpful during the past days. Will put back our Content-Length sensitive code (2) if nothing comes up.

Mentioned in SAL [2016-05-13T10:27:01Z] <ema> depooling cp1061 trying to reproduce T134989

Issue appearing again

curl -v 'https://query.wikidata.org/i18n/en.json'
* Hostname was NOT found in DNS cache
*   Trying 91.198.174.217...
* Connected to query.wikidata.org (91.198.174.217) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
* SSL connection using ECDHE-ECDSA-AES128-GCM-SHA256
* Server certificate:
* 	 subject: C=US; ST=California; L=San Francisco; O=Wikimedia Foundation, Inc.; CN=*.wikipedia.org
* 	 start date: 2015-12-10 23:22:05 GMT
* 	 expire date: 2016-12-10 22:46:04 GMT
* 	 subjectAltName: query.wikidata.org matched
* 	 issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
* 	 SSL certificate verify ok.
> GET /i18n/en.json HTTP/1.1
> User-Agent: curl/7.35.0
> Host: query.wikidata.org
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 13 May 2016 10:53:06 GMT
< Content-Type: application/json
< Content-Length: 735
< Connection: keep-alive
* Server nginx/1.9.4 is not blacklisted
< Server: nginx/1.9.4
< Last-Modified: Wed, 27 Apr 2016 22:48:11 GMT
< X-Served-By: wdqs1002
< ETag: W/"572141ab-2df"
< Vary: Accept-Encoding
< X-Varnish: 9318317, 175488687, 162750753 170268466
< Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
< Age: 58
< X-Cache: cp1045 hit+miss(0), cp3009 hit+miss(0), cp3007 frontend hit(4)
< Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
< Set-Cookie: WMF-Last-Access=13-May-2016;Path=/;HttpOnly;secure;Expires=Tue, 14 Jun 2016 00:00:00 GMT
< X-Analytics: https=1;nocookies=1
< X-Client-IP: 87.138.110.76
< Accept-Ranges: bytes
< 
* SSLv3, TLS alert, Client hello (1):
* transfer closed with 735 bytes remaining to read
* Closing connection 0
* SSLv3, TLS alert, Client hello (1):
curl: (18) transfer closed with 735 bytes remaining to read
ema added a comment.May 13 2016, 11:04 AM

Issue appearing again

[...]

< Age: 58
< X-Cache: cp1045 hit+miss(0), cp3009 hit+miss(0), cp3007 frontend hit(4)

On @Gehel's repro from this morning (https://phabricator.wikimedia.org/P3067) we had:

< Age: 87
< X-Cache: cp1061 hit+miss(0), cp3010 hit(2), cp3009 frontend miss(0)

Might be a coincidence, but hit+miss seemed to be involved in other repros as well:
https://phabricator.wikimedia.org/P3055 https://www.irccloud.com/pastebin/u0ti8RQe/

Mentioned in SAL [2016-05-13T13:13:56Z] <ema> depooling cp3007 to try reproducing T134989

Change 288655 had a related patch set uploaded (by BBlack):
cache_misc: downgrade almost all to varnish3

https://gerrit.wikimedia.org/r/288655

Change 288655 merged by BBlack:
cache_misc: downgrade almost all to varnish3

https://gerrit.wikimedia.org/r/288655

hashar moved this task from Triage to In Progress on the Traffic board.May 13 2016, 8:25 PM

I forgot one of our temporary hacks in the list above in T134989#2290254:

  1. https://gerrit.wikimedia.org/r/#/c/288656/ - we also enabled a critical small bit here in v4 vcl_hit. I reverted this for now during the varnish3 downgrade. Need to remember that once we find the right bug and start cleaning everything up for upgrade again...
BBlack added a comment.EditedMay 13 2016, 9:16 PM

Current State:

  • cp3007 and cp1045 are depooled from user traffic, icinga-downtimed for several days, and have puppet disabled. Please do not re-enable puppet on these! They also have confd shut down, and are running custom configs to continue debugging this issue under varnish4.
  • The rest of cache_misc is reverted to varnish3, which should temporarily resolve this issue for user traffic over the weekend and into next week while we continue isolated investigation using the two nodes above.
  • Please do not resolve this ticket - this is a bandaid, and we still have a lot of digging to do.
  • Please do report any similar user-facing failures from here forward, as there shouldn't be any while the cluster is reverted to varnish3.
Nirzar removed a subscriber: Nirzar.May 13 2016, 9:26 PM
ema added a comment.May 14 2016, 11:14 PM

The issue can be reproduced running two varnishd on the same host, without any VCL. A web server listening on port 80 is also needed, serving any type of file. I've used a 612 bytes HTML page.

/usr/sbin/varnishd -a :3128 -b localhost:80   -p default_ttl=5 -p default_grace=10 -F
/usr/sbin/varnishd -a :81   -b localhost:3128 -p default_ttl=4 -p default_grace=9 -F

https://phabricator.wikimedia.org/P3103 is an example of how to trigger it (roughly around iteration 61 on my tests). Varnish 4.1.2-2 from Debian Sid is also affected.

ema added a comment.May 15 2016, 2:11 PM

Update: setting default_grace to 0 makes the issue unreproducible. 4.1.0 is not affected, while 4.1.1 is, so I started looking for grace-related commits between the two versions.

Looks like the bug was introduced by this one: https://github.com/varnishcache/varnish-cache/commit/8b7cb51b76ad616143040ee955f1f6d9306251b9

Testing on cp1045.

ema added a comment.May 15 2016, 3:22 PM

I haven't been able to reproduce the bug so far after reverting the commit mentioned above. cp1045 and cp3007 are both running fine with an updated version of varnish.

Please use cp3007:80 if you want to play around, it will only hit itself and cp1045 to ensure we're testing Varnish 4 only.

curl -v -H 'X-Forwarded-Proto: https' -H 'Host: query.wikidata.org' http://cp3007.esams.wmnet/style.css?test=3

Updated packages will follow soon.

Change 288817 had a related patch set uploaded (by Ema):
Revert 'Allow grace-hits on hit-for-pass objects'

https://gerrit.wikimedia.org/r/288817

Change 288817 merged by Ema:
Revert 'Allow grace-hits on hit-for-pass objects'

https://gerrit.wikimedia.org/r/288817

cache_maps cluster switched to the new varnish package today

Change 289148 had a related patch set uploaded (by Ema):
Revert "cache_misc: downgrade almost all to varnish3"

https://gerrit.wikimedia.org/r/289148

Change 289148 merged by Ema:
Revert "cache_misc: downgrade almost all to varnish3"

https://gerrit.wikimedia.org/r/289148

Mentioned in SAL [2016-05-17T08:13:43Z] <ema> upgrading eqiad cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T08:31:56Z] <ema> upgrading codfw cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T09:02:23Z] <ema> upgrading esams cache_misc to varnish 4 (T126206, T134989)

Mentioned in SAL [2016-05-17T09:14:57Z] <ema> upgrading ulsfo cache_misc to varnish 4 (T131501, T134989)

ema added a comment.May 17 2016, 9:32 AM

I've upgraded cache_misc to Varnish 4 again. We're now running a patched version of Varnish (4.1.2-1wm5) including the following patches:

The first two have been introduced while debugging, and although they don't solve the bug they look legit and in general it feels like a good idea to keep them. The 3rd patch is the one that we think solves the empty response issue.

Although we think the problem should now be fixed, I'd keep this bug open a little longer. Please keep on testing and report any issues you might find.

ema closed this task as Resolved.May 24 2016, 9:41 AM
ema claimed this task.

Nobody reproduced this issue during the past week. Closing.

Change 301374 had a related patch set uploaded (by BBlack):
Revert "Revert "Revert "cache_misc: do not deliver expired cached objects"""

https://gerrit.wikimedia.org/r/301374

Change 301374 merged by BBlack:
Revert "Revert "Revert "cache_misc: do not deliver expired cached objects"""

https://gerrit.wikimedia.org/r/301374