Page MenuHomePhabricator

esams cache layer mangles downloads of specific url
Open, LowPublic

Description

I 've tried multiple times today to download https://releases.wikimedia.org/blubber/linux-amd64/blubber, always failing, regardless of the use of HTTP/2 or HTTP/1.1 with a response many times smaller than the reported one. Content-Length is always set (correctly) to 9298262
When bypassing the cache by appending a query string, the size of the response varies between 500k and 1.5MB.

$ for i in `seq 1 10`; do curl --http1.1 -v -o /dev/null "https://releases.wikimedia.org/blubber/linux-amd64/blubber?cachebust=$(date +%s)"; done |& grep remaining
* transfer closed with 8349231 bytes remaining to read
curl: (18) transfer closed with 8349231 bytes remaining to read
* transfer closed with 8515022 bytes remaining to read
curl: (18) transfer closed with 8515022 bytes remaining to read
* transfer closed with 8333862 bytes remaining to read
curl: (18) transfer closed with 8333862 bytes remaining to read
* transfer closed with 8139612 bytes remaining to read
curl: (18) transfer closed with 8139612 bytes remaining to read
* transfer closed with 8252255 bytes remaining to read
curl: (18) transfer closed with 8252255 bytes remaining to read
* transfer closed with 8211355 bytes remaining to read
curl: (18) transfer closed with 8211355 bytes remaining to read
* transfer closed with 8051261 bytes remaining to read
curl: (18) transfer closed with 8051261 bytes remaining to read
* transfer closed with 8188283 bytes remaining to read
curl: (18) transfer closed with 8188283 bytes remaining to read
* transfer closed with 8268910 bytes remaining to read
curl: (18) transfer closed with 8268910 bytes remaining to read
* transfer closed with 8235928 bytes remaining to read
curl: (18) transfer closed with 8235928 bytes remaining to read

Apache on releases1001, consistently reports 200 OK with a file size of 9298262 in logs.

When forcing curl to resolve eqiad/codfw/ulsfo/eqsin (with --resolve releases.wikimedia.org:443:<ip>) instead the response is returned just fine.

Looking at the X-Cache header in 10 subsequent runs on esams the only thing that seems common is cp3030.

for i in `seq 1 10`; do curl --resolve releases.wikimedia.org:443:91.198.174.192 --http1.1 -v -o /dev/null "https://releases.wikimedia.org/blubber/linux-amd64/blubber?cachebust=$(date +%s)" ; done |& grep X-Cache
< X-Cache: cp1081 pass, cp3033 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1075 pass, cp3032 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1075 pass, cp3041 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1085 pass, cp3030 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1087 pass, cp3033 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1087 pass, cp3033 hit/1, cp3030 miss
< X-Cache-Status: hit-local
< X-Cache: cp1083 pass, cp3033 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1089 pass, cp3040 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1075 pass, cp3040 miss, cp3030 miss
< X-Cache-Status: miss
< X-Cache: cp1081 pass, cp3030 miss, cp3030 miss
< X-Cache-Status: miss

Event Timeline

akosiaris created this task.Wed, Feb 6, 9:42 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Feb 6, 9:42 AM
akosiaris triaged this task as Normal priority.Wed, Feb 6, 9:42 AM
akosiaris updated the task description. (Show Details)Wed, Feb 6, 9:49 AM
akosiaris raised the priority of this task from Normal to High.Wed, Feb 6, 10:00 AM

Mentioned in SAL (#wikimedia-operations) [2019-02-06T10:01:51Z] <akosiaris> restart varnish-frontend on cp3030 T215389

akosiaris lowered the priority of this task from High to Low.Wed, Feb 6, 10:08 AM

The restart of varnish-frontend on cp3030 indeed resolved the issue. I 'll lower priority but leave task open. Feel free to resolve however.

Mentioned in SAL (#wikimedia-operations) [2019-02-06T11:56:43Z] <vgutierrez> restarting varnish-fe in cp3042 - T215389

Mentioned in SAL (#wikimedia-operations) [2019-02-06T12:20:56Z] <vgutierrez> restarting varnish-fe safely across esams/text cluster - T215389

Checking the rest of the text cluster in esams from bast3002 showed that all of them where affected. After restarting varnish-frontend the issue is gone. I'll leave the task open for further discussion with @BBlack

Thanks @akosiaris!