Page MenuHomePhabricator

docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`
Closed, ResolvedPublic

Description

Sometimes (but not always) docker-compose up fails with ERROR: missing or empty Content-Length header. I can not reproduce it consistently.

~/Documents/gerrit/mediawiki/core$ sw_vers      
ProductName:	macOS
ProductVersion:	11.1
BuildVersion:	20C69

~/Documents/gerrit/mediawiki/core$ docker --version
Docker version 20.10.0, build 7287ab3

~/Documents/gerrit/mediawiki/core$ docker-compose --version
docker-compose version 1.27.4, build 40524192

~/Documents/gerrit/mediawiki/core$ cat .env 
MW_DOCKER_PORT=8080
MW_SCRIPT_PATH=/w
MW_SERVER=http://localhost:8080
MEDIAWIKI_USER=Admin
MEDIAWIKI_PASSWORD=dockerpass
XDEBUG_CONFIG=''

~/Documents/gerrit/mediawiki/core$ docker-compose up -d
Pulling mediawiki (docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1)...
ERROR: missing or empty Content-Length header

Event Timeline

Sometimes it works fine.

~/Documents/gerrit/mediawiki/core$ sw_vers                 
ProductName:	Mac OS X
ProductVersion:	10.15.7
BuildVersion:	19H114

~/Documents/gerrit/mediawiki/core$ docker --version        
Docker version 20.10.0, build 7287ab3

~/Documents/gerrit/mediawiki/core$ docker-compose --version
docker-compose version 1.27.4, build 40524192

~/Documents/gerrit/mediawiki/core$ cat .env 
MW_DOCKER_PORT=8080
MW_SCRIPT_PATH=/w
MW_SERVER=http://localhost:8080
MEDIAWIKI_USER=Admin
MEDIAWIKI_PASSWORD=dockerpass
XDEBUG_CONFIG=''

...
~/Documents/gerrit/mediawiki/core$ docker-compose up -d
Pulling mediawiki (docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1)...
0.6.1-s1: Pulling from dev/stretch-php72-fpm-apache2-xdebug
7b04e34b88d3: Pull complete
e01d5733b39b: Pull complete
1106614dbeca: Pull complete
ab32434d3734: Pull complete
0b392d63fa40: Pull complete
b30916698108: Pull complete
7da1640ae982: Pull complete
84f6be64c2fb: Pull complete
2c93bba5818d: Pull complete
f6d31b5c8678: Pull complete
1ed18c6fb732: Pull complete
4f658097acbb: Pull complete
8fe3ac333cf5: Pull complete
49d982d621f9: Pull complete
af20cdbf290b: Pull complete
1bdd896f7585: Pull complete
61901e556d6d: Pull complete
5e83e724f1ac: Pull complete
d5cbae0612bd: Pull complete
d087ba968d42: Pull complete
68479603b8b3: Pull complete
5681a3c55f1f: Pull complete
eec1a5d0d874: Pull complete
fbe01eb49875: Pull complete
Digest: sha256:1bc2a861502cacbbe93c7efa2b3a369eee4af9ec9d39006b9ee42d84d658c368
Status: Downloaded newer image for docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1
Pulling mediawiki-jobrunner (docker-registry.wikimedia.org/dev/stretch-php72-jobrunner:0.1.0)...
0.1.0: Pulling from dev/stretch-php72-jobrunner
f4950aafeb34: Pull complete
8e64bee919d5: Pull complete
7ecb091f2267: Pull complete
2c4564e6cff0: Pull complete
01450fc4f6e3: Pull complete
a2c9dae68ea5: Pull complete
1c17fbd0146a: Pull complete
8df273e99cc1: Pull complete
c4d62258b865: Pull complete
4281cbeb4eb7: Pull complete
49eaf43db8cc: Pull complete
46ff7a85e9d2: Pull complete
4499dbd10051: Pull complete
ba0d42befcdd: Pull complete
7873b6eaade3: Pull complete
Digest: sha256:242ca9c3a6f9e54acfb2d6669d3486873f49c02742813fb6e2f5d0111fc85a7f
Status: Downloaded newer image for docker-registry.wikimedia.org/dev/stretch-php72-jobrunner:0.1.0
Creating core_mediawiki_1           ... done
Creating core_mediawiki-jobrunner_1 ... done

I saw this earlier when pulling a different image from the dev repo. I think it is some intermittent issue with our docker registry. Cc @hashar

zeljkofilipin updated the task description. (Show Details)
hashar renamed this task from `docker-compose up` fails with `ERROR: missing or empty Content-Length header` to docker pull from docker-registry fails with `ERROR: missing or empty Content-Length header`.Dec 16 2020, 1:44 PM
$ docker pull docker-registry.wikimedia.org/releng/node10-test-browser
Using default tag: latest
Error response from daemon: missing or empty Content-Length header

I can't get more verbose output from docker with debug/log level, but maybe I'm doing something wrong.

Are we using harbor (was scanning T209271)? If so, perhaps this is the same issue as https://github.com/goharbor/harbor/issues/13740

I can not reproduce it consistently.

@zeljkofilipin so eventually it worked for you? I am seeing this consistently now, I can't pull anything, Docker version 20.10.0, build 7287ab3890.

@kostajh it was failing consistently on my macos 11 machine, working fine on my macos 10 machine. While trying to figure out if it's a macos version problem, it succeeded after several tries on macos 11, and failed a few times on macos 10 machine. 🤷‍♂️

Are we using harbor (was scanning T209271)? If so, perhaps this is the same issue as https://github.com/goharbor/harbor/issues/13740

We're currently using the standard docker implementation (docker-registry) and that has not changed in a bit. I'm currently unable to reproduce but it reminds me of T266373 ...

Could you please check if you see any additional errors/hints in the docker daemon logs? The image/layers are not actually pulled by the client so a hint might be in there. You could also enable debug there (https://docs.docker.com/config/daemon/#enable-debugging) which could lead us somewhere.

Could you please check if you see any additional errors/hints in the docker daemon logs? The image/layers are not actually pulled by the client so a hint might be in there. You could also enable debug there (https://docs.docker.com/config/daemon/#enable-debugging) which could lead us somewhere.

Hmm, the logs are extremely verbose; I don't see anything obvious so far.

I can reproduce the error on macOS 11.0.1 + Docker for Mac 3.0.1 (Docker version 20.10.0) on two machines, one on Intel hardware and the other on m1/Silicon (Docker for Mac 3.0.1 but a preview build). Pulling from Docker Hub works fine on both machines, pulling from our Wikimedia registry almost always fails although I did successfully pull one image (node10-test-browser).

The logs don't seem to show that much useful information. I set "debug: true" in the daemon options in Docker for Mac and restarted it. Then:

$ pred='process matches ".*(ocker|vpnkit).*"
  || (process in {"taskgated-helper", "launchservicesd", "kernel"} && eventMessage contains[c] "docker")'
$ /usr/bin/log show --debug --info --style syslog --start '2020-12-17 09:39:0' --predicate "$pred"
$ docker -D -l debug --debug pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1
Timestamp                       (process)[PID]    
2020-12-17 09:39:01.296884+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:01.301589+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (4.723583ms)
2020-12-17 09:39:06.305801+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:06.317205+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (11.404667ms)
2020-12-17 09:39:11.338368+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:11.348841+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (10.531458ms)
2020-12-17 09:39:16.362406+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:16.368079+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (5.682875ms)
2020-12-17 09:39:21.379440+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:21.388837+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (9.454708ms)
2020-12-17 09:39:26.393348+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:26.400256+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (6.923291ms)
2020-12-17 09:39:31.404876+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:31.415565+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (10.709708ms)
2020-12-17 09:39:36.420066+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:36.426577+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (6.514375ms)
2020-12-17 09:39:41.431444+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:41.440836+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (9.448125ms)
2020-12-17 09:39:46.444530+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:46.449377+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (4.861292ms)
2020-12-17 09:39:51.453277+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:51.461164+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (7.885375ms)
2020-12-17 09:39:56.466511+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:39:56.476600+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (10.112333ms)
2020-12-17 09:39:59.814320+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> HEAD /_ping
2020-12-17 09:39:59.816166+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << HEAD /_ping (1.85875ms)
2020-12-17 09:39:59.834824+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> HEAD /_ping
2020-12-17 09:39:59.836566+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << HEAD /_ping (1.75325ms)
2020-12-17 09:39:59.854889+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> HEAD /_ping
2020-12-17 09:39:59.856629+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << HEAD /_ping (1.749916ms)
2020-12-17 09:39:59.907073+0100  localhost docker-credential-desktop[92818]: (libsystem_info.dylib) Created Activity ID: 0x512af0, Description: Retrieve User by ID
2020-12-17 09:39:59.937977+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Daemon status changed from 0 to 1
2020-12-17 09:39:59.938463+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Received configuration update from daemon and there was no CoreAnalyticsFramework external config.
2020-12-17 09:39:59.938464+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Received configuration update from daemon (initial)
2020-12-17 09:39:59.939465+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.939491+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946375+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946400+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946407+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946422+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946425+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946463+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946475+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946578+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946582+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946589+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.946592+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947495+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947498+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947501+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947503+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947510+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947512+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947515+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947517+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947901+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.947903+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.948879+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.948881+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Dropping com.apple.security.LegacyAPICounts as it isn't used in any transform (not in the config or budgeted?)
2020-12-17 09:39:59.952254+0100  localhost docker-credential-osxkeychain[92819]: (CarbonCore) [com.apple.CarbonCore:nameddata] _CSGetNamedData _CSGetNamedData(/System/Library/Frameworks/Security.framework/Resources/en.lproj/SecErrorMessages.strings, err=0 data=0x4780000/70772
2020-12-17 09:39:59.953256+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Entering exit handler.
2020-12-17 09:39:59.953272+0100  localhost docker-credential-osxkeychain[92819]: (CoreAnalytics) [com.apple.CoreAnalytics:client] Exiting exit handler.
2020-12-17 09:39:59.955452+0100  localhost com.docker.driver.amd64-linux[92602]: usage imagePullCliLinux + 1
2020-12-17 09:39:59.955542+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> POST /v1.41/images/create?fromImage=docker-registry.wikimedia.org%2Fdev%2Fstretch-php72-fpm-apache2-xdebug&tag=0.6.1-s1
2020-12-17 09:40:00.229401+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << POST /v1.41/images/create?fromImage=docker-registry.wikimedia.org%2Fdev%2Fstretch-php72-fpm-apache2-xdebug&tag=0.6.1-s1 (273.766917ms)
2020-12-17 09:40:00.242120+0100  localhost com.docker.backend[90751]: received new cli usage: {Command:pull Context:moby}
2020-12-17 09:40:00.243637+0100  localhost com.docker.backend[90751]: cli: POST /usage 200 "Go-http-client/1.1" ""
2020-12-17 09:40:01.486531+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:01.494314+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (7.783458ms)
2020-12-17 09:40:06.501136+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:06.511637+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (10.534334ms)
2020-12-17 09:40:11.519183+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:11.527905+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (8.787541ms)
2020-12-17 09:40:16.535690+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:16.540596+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (4.927542ms)
2020-12-17 09:40:21.545072+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:21.554728+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (9.693042ms)
2020-12-17 09:40:26.559665+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:26.568688+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (9.068042ms)
2020-12-17 09:40:31.573674+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:31.581177+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (7.527875ms)
2020-12-17 09:40:36.597383+0100  localhost com.docker.driver.amd64-linux[92602]: proxy >> GET /images/json
2020-12-17 09:40:36.606210+0100  localhost com.docker.driver.amd64-linux[92602]: proxy << GET /images/json (8.851625ms)

Unfortunately still unable to reproduce, even with docker & docker-engine 20.10.1 (on linux)

I can reproduce on Debian unstable with docker pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1

Client: Docker Engine - Community
Version:           20.10.1
API version:       1.41
Go version:        go1.13.15
Git commit:        831ebea
Built:             Tue Dec 15 04:34:48 2020
OS/Arch:           linux/amd64
Context:           default
Experimental:      true

Server: Docker Engine - Community
Engine:
  Version:          20.10.1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       f001486
  Built:            Tue Dec 15 04:32:45 2020
  OS/Arch:          linux/amd64
  Experimental:     false
containerd:
  Version:          1.4.3
  GitCommit:        269548fa27e0089a8b8278fc4fc781d7f65a939b
runc:
  Version:          1.0.0-rc92
  GitCommit:        ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

I was able to reproduce as well now. Seems like we are missing the "content-length" header (at least on the manifest API endpoint) if the response is *not* served from cache. P13565 can reproduce that when using "docker-registry.wikimedia.org" but not when using "docker-registry.discovery.wmnet".

Traffic to the rescue :)

JMeybohm raised the priority of this task from Low to High.Dec 17 2020, 10:50 AM

Also rising priority as I guess this will affect more and more users/devs when they upgrade to docker 20.10.

Change 650156 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: do not stream responses to docker

https://gerrit.wikimedia.org/r/650156

Change 650156 merged by Ema:
[operations/puppet@production] vcl: do not stream responses to docker

https://gerrit.wikimedia.org/r/650156

JMeybohm claimed this task.
JMeybohm added a subscriber: ema.

For the record:
We where sending "Content-Type: application/vnd.docker.distribution.manifest.v2+json" as a gzip stream to the clients (hence no content-length) and docker 20.10 seems to no longer accept that but instead fail.

There now is an exception in varnish that disables streaming if http.user-agent ~ "^docker".

Thanks @ema !

This should now be fixed by using https://gerrit.wikimedia.org/r/c/operations/puppet/+/650156/ as a VCL workaround, please try to reproduce again!

This should now be fixed by using https://gerrit.wikimedia.org/r/c/operations/puppet/+/650156/ as a VCL workaround, please try to reproduce again!

Thank you @ema and @JMeybohm for your quick work on this!

I'm still seeing the problem though:

$ docker pull docker-registry.wikimedia.org/dev/stretch-php72-jobrunner:0.1.0
Error response from daemon: missing or empty Content-Length header

Maybe Mac sends a different user-agent? That would be fun...

Maybe Mac sends a different user-agent? That would be fun...

Nope, the User-Agent of @kostajh's client is "docker/20.10.1 go/go1.13.15 git-commit/f001486[...]". I captured a varnishlog while he reproduced the issue, and it boils down to just:

GET /v2/ -> Content-Length: 2
HEAD /v2/dev/stretch-php72-jobrunner/manifests/0.1.0 -> no CL

Further digging needed.

The current theory is that the problem boils down to the following HEAD request returning no Content-Length if served by cp3052:

$ curl -I -H "User-Agent: docker/20.10.0+dfsg2" -H "Accept: application/vnd.docker.distribution.manifest.v2+json" https://docker-registry.wikimedia.org/v2/dev/stretch-php72-jobrunner/manifests/0.1.0
[...]
x-cache: cp3054 miss, cp3052 miss

The same request served by another cache node does return Content-Length:

$ curl -I -H "User-Agent: docker/20.10.0+dfsg2" -H "Accept: application/vnd.docker.distribution.manifest.v2+json" https://docker-registry.wikimedia.org/v2/dev/stretch-php72-jobrunner/manifests/0.1.0
[...]
x-cache: cp3054 miss, cp3058 miss
content-length: 3452

cp3052 is running and old version of Varnish (5.1.3), which clearly behaves differently in this regard compared to 6.0.x.

That being said, I suspect that our VCL trying to do_gzip on a HEAD request might confuse things. Let's try to only do_gzip for GET.

Change 650248 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: do not gzip responses to HEAD requests

https://gerrit.wikimedia.org/r/650248

Change 650248 merged by Ema:
[operations/puppet@production] vcl: do not gzip responses to HEAD requests

https://gerrit.wikimedia.org/r/650248

Change 650256 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: do not gzip docker-registry responses

https://gerrit.wikimedia.org/r/650256

Change 650256 merged by Ema:
[operations/puppet@production] vcl: do not gzip docker-registry responses

https://gerrit.wikimedia.org/r/650256

Mentioned in SAL (#wikimedia-operations) [2020-12-17T21:01:50Z] <ema> cp3052: ban 'req.http.host == "docker-registry.wikimedia.org"' T270270

That being said, I suspect that our VCL trying to do_gzip on a HEAD request might confuse things. Let's try to only do_gzip for GET.

Varnish opportunistically converts cache misses to HEAD requests into backend GETs to potentially cache the response, so excluding bereq.method ~ "HEAD" did nothing.

Change 650256 merged by Ema:
[operations/puppet@production] vcl: do not gzip docker-registry responses

https://gerrit.wikimedia.org/r/650256

This did work, instead. I've banned all docker-registry objects cached on cp3052 as many surely had no Content-Length. docker pull should work now.

Change 650256 merged by Ema:
[operations/puppet@production] vcl: do not gzip docker-registry responses

https://gerrit.wikimedia.org/r/650256

This did work, instead. I've banned all docker-registry objects cached on cp3052 as many surely had no Content-Length. docker pull should work now.

As indeed it does. Thank you! \o/

Change 650191 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: do not stream responses to docker"

https://gerrit.wikimedia.org/r/650191

Change 650191 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "vcl: do not stream responses to docker"

https://gerrit.wikimedia.org/r/650191

I think we can revert the no-streaming patch then, will check tomorrow. Thanks @kostajh!

Change 650191 merged by Ema:
[operations/puppet@production] Revert "vcl: do not stream responses to docker"

https://gerrit.wikimedia.org/r/650191

Change 650502 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnish: add 37-docker-registry-cl-head.vtc

https://gerrit.wikimedia.org/r/650502

Change 650502 merged by Ema:
[operations/puppet@production] varnish: add 37-docker-registry-cl-head.vtc

https://gerrit.wikimedia.org/r/650502

I think we can revert the no-streaming patch then, will check tomorrow. Thanks @kostajh!

Reverted the do-not-stream patch, docker pull still seems to work fine against both cp3052 (Varnish 5.1.3-1wm15) and cp3062 (Varnish 6.0.7-1wm1). To target cp3052 specifically I used this trick with 127.0.0.1 docker-registry.wikimedia.org in /etc/hosts. Here's how I tried to reproduce the issue:

# on the cache node my IP is c-hashed to, log all requests performed by my docker client.
# see https://phabricator.wikimedia.org/P13603 for some of the output
varnishlog -n frontend -q 'ReqHeader:User-Agent ~ "docker" and 'ReqHeader:X-Client-IP ~ "$bigsecret"' | tee ema-docker-18-Dec.log

# workstation, docker pull works
docker pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1

# purge all related objects on cache node to get cache misses on next "docker pull" 
awk '/ReqURL/ {print $3}' ema-docker-18-Dec.log | while read uri ; do curl -X PURGE -H "Host: docker-registry.wikimedia.org" http://127.0.0.1$uri; done

# all the commands below on my workstation
docker image rm docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1
# cache misses
docker pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1

docker image rm docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1
# cache hits
docker pull docker-registry.wikimedia.org/dev/stretch-php72-fpm-apache2-xdebug:0.6.1-s1

The VCL patch fixing the problem is do not gzip docker-registry responses, and I now know more about the Varnish bug itself. This VTC test fails on both Varnish 6.0.7 and 5.1.3:

varnishtest "Is Content-Length sent in response to HEAD requests for objects cached with do_gzip=true?"

server s1 {
    rxreq
    txresp -bodylen 1024
} -start

varnish v1 -arg "-p vcc_err_unref=false" -vcl+backend {
    sub vcl_backend_response {
        set beresp.do_gzip = true;
    }
} -start

client c1 {
    txreq -req "HEAD"
    rxresp -no_obj
    expect resp.status == 200
    # The following expect fails on both v5.1.3 and 6.0.7
    expect resp.http.content-length == 1024

    txreq -req "HEAD"
    rxresp -no_obj
    expect resp.status == 200
    # The following expect fails on v5.1.3 only
    expect resp.http.content-length == 1024
} -connect "${v1_addr} ${v1_port}" -run

Which is to say: on Varnish 5.1.3, responses to HEAD requests for objects cached with beresp.do_gzip = true have no Content-Length regardless of cache lookup status (hit vs miss). On 6.0.7 they have Content-Length only upon cache hit. This explains why for @kostajh docker pull was failing all the time (his IP was c-hashed to a Varnish 5.1.3 instance), while for others the failure was intermittent (their requests ended up on cache nodes running Varnish 6.0.7). On hosts with Varnish 6.0.7 the first docker pull would fail (cache miss hence no Content-Length), the next ones would work. Commenting the set beresp.do_gzip = true; line in vcl_backend_response in the test above, both hits and misses get Content-Length: 1024 regardless of the Varnish version. The workaround thus is just to avoid beresp.do_gzip for docker-registry altogether.

Thank you @ema for the full explanation (and for the fix of course)!