Page MenuHomePhabricator

Error pulling image from docker registry
Closed, ResolvedPublic

Description

When attempting to pull Kask v1.0.3 image from docker-registry.wikimedia.org I get the following error:

Error response from daemon: received unexpected HTTP status: 502 connect failed

The error was traced to pulling the manifest from the registry; e.g., curl https://docker-registry.wikimedia.org/v2/wikimedia/mediawiki-services-kask/manifests/v1.0.3

Snippet of the error html:

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class='text-muted'><code>Request from 2601:406:4300:872d:c042:21e9:5c83:e275 via cp1075.eqiad.wmnet, ATS/8.0.5<br>Error: 502, connect failed at 2019-08-27 21:22:22 GMT</code></p></div>

Event Timeline

Jdforrester-WMF subscribed.

Tagging in Traffic; this is the server (cp1075) running ATS not Varnish, right?

BBlack added subscribers: ema, BBlack.

Assigning to @ema to investigate (yes, this is the live test server for ATS backends for these servers). Most likely the problem is specific to ATS<->docker-registry, probably because the underlying service TLS certificate's SAN list doesn't match the public name docker-registry.wikimedia.org.

greg triaged this task as Unbreak Now! priority.Aug 27 2019, 9:44 PM
greg subscribed.

This is blocking CI runs.

This comment was removed by ayounsi.

Depooled cp1075 ats-be service via confctl, can someone retry and confirm mitigated?

Depooled cp1075 ats-be service via confctl, can someone retry and confirm mitigated?

It works!

Please leave this open for now so @ema can look at a more-permanent fixup tomorrow!

Jdforrester-WMF lowered the priority of this task from Unbreak Now! to Medium.Aug 27 2019, 10:00 PM

De-prioritising.

A proper fix for this issue is blocked on cergen bug T231423.

I am going to disable TLS between ATS and eqiad's docker-registry for the time being. cp1075 is also in eqiad, so there's no cross-DC traffic that can be snooped on anyways.

Change 532953 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily use plain HTTP to access docker-registry

https://gerrit.wikimedia.org/r/532953

Change 532953 merged by Ema:
[operations/puppet@production] ATS: temporarily use plain HTTP to access docker-registry

https://gerrit.wikimedia.org/r/532953

Change 533041 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily use plain HTTP to access docker-registry"

https://gerrit.wikimedia.org/r/533041

Change 533041 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily use plain HTTP to access docker-registry"

https://gerrit.wikimedia.org/r/533041

We have managed to generate a proper certificate for the docker-registry origin servers, and cp1075 is now back to using TLS to connect to them.

$ curl -v --resolve docker-registry.wikimedia.org:443:208.80.154.224 https://docker-registry.wikimedia.org/v2/wikimedia/mediawiki-services-kask/manifests/v1.0.3
[...]
> GET /v2/wikimedia/mediawiki-services-kask/manifests/v1.0.3 HTTP/2
> Host: docker-registry.wikimedia.org
[...]
< HTTP/2 200 
< date: Thu, 29 Aug 2019 10:45:21 GMT
< content-type: application/vnd.docker.distribution.manifest.v1+prettyjws
[...]
< x-cache: cp1075 miss, cp1077 miss
[...]
{
   "schemaVersion": 1,
   "name": "wikimedia/mediawiki-services-kask",
   "tag": "v1.0.3",
   "architecture": "amd64",