Page MenuHomePhabricator

Docker image on the build host seem to ignore apt priority for wikimedia packages
Closed, ResolvedPublic

Description

I had a problem today when building the docker images for T265324, where php-igbinary would get installed from the base repository instead than from the wmf one.

The problem seems to be that the apt priority we define in our base image:

root@deneb:/srv/images/production-images# docker run --rm docker-registry.wikimedia.org/wikimedia-buster:latest /bin/cat /etc/apt/preferences.d/wikimedia
Package: \*
Pin: release o=Wikimedia
Pin-Priority: 1001

doesn't get honored when running on deneb. This is even more mysterious:

deneb:~$ sudo docker run --rm docker-registry.wikimedia.org/wikimedia-buster:latest /bin/sh -c "echo 'Acquire::http::Proxy \"http://webproxy.codfw.wmnet:8080\";' > /etc/apt/apt.conf.d/80_proxy && apt-get update && apt-cache policy prometheus-php-fpm-exporter"
Get:1 http://security.debian.org buster/updates InRelease [65.4 kB]
Get:2 http://mirrors.wikimedia.org/debian buster InRelease [121 kB]
Get:3 http://apt.wikimedia.org/wikimedia buster-wikimedia InRelease [99.7 kB]
Get:4 http://mirrors.wikimedia.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://security.debian.org buster/updates/main amd64 Packages [315 kB]
Get:6 http://mirrors.wikimedia.org/debian buster-backports InRelease [46.7 kB]
Get:7 http://mirrors.wikimedia.org/debian buster/main amd64 Packages [10.7 MB]
Get:8 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages [63.8 kB]
Get:9 http://mirrors.wikimedia.org/debian buster-updates/main amd64 Packages [8728 B]
Get:10 http://mirrors.wikimedia.org/debian buster-backports/main amd64 Packages [372 kB]
Get:11 http://mirrors.wikimedia.org/debian buster-backports/contrib amd64 Packages [7820 B]
Fetched 11.9 MB in 2s (6640 kB/s)
Reading package lists...
prometheus-php-fpm-exporter:
  Installed: (none)
  Candidate: 0.4.1+git20181018.d0d1837-2
  Version table:
     0.4.1+git20181018.d0d1837-2 500
        500 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages

but it works with the *same image* on my computer, where I am not adding the proxy configuration:

$ sudo docker run --rm docker-registry.wikimedia.org/wikimedia-buster:latest /bin/sh -c "apt-get update && apt-cache policy prometheus-php-fpm-exporter"
Get:1 http://security.debian.org buster/updates InRelease [65.4 kB]
Get:2 http://security.debian.org buster/updates/main amd64 Packages [315 kB]
Get:3 http://apt.wikimedia.org/wikimedia buster-wikimedia InRelease [99.7 kB]
Get:4 http://mirrors.wikimedia.org/debian buster InRelease [121 kB]
Get:5 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages [63.8 kB]
Get:6 http://mirrors.wikimedia.org/debian buster-updates InRelease [51.9 kB]
Get:7 http://mirrors.wikimedia.org/debian buster-backports InRelease [46.7 kB]
Get:8 http://mirrors.wikimedia.org/debian buster/main amd64 Packages [10.7 MB]
Get:9 http://mirrors.wikimedia.org/debian buster-updates/main amd64 Packages [8728 B]
Get:10 http://mirrors.wikimedia.org/debian buster-backports/main amd64 Packages [372 kB]
Get:11 http://mirrors.wikimedia.org/debian buster-backports/contrib amd64 Packages [7820 B]
Fetched 11.9 MB in 3s (3929 kB/s)
Reading package lists...
prometheus-php-fpm-exporter:
  Installed: (none)
  Candidate: 0.4.1+git20181018.d0d1837-2
  Version table:
     0.4.1+git20181018.d0d1837-2 1001
       1001 http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages

Why is this happening? It needs to be investigated ASAP as this can lead to unintended consequences.

Event Timeline

ok, found the problem, and it's slightly embarassing:

docker-registry.wikimedia.org/wikimedia-buster:latest was 12 months old on deneb, while

docker-registry.discovery.wmnet/wikimedia-buster:latest was not.

This means this issue affects *all* of our images based on buster that reference the public image.

What needs to be done now:

  • modify the build-base-images script to also re-tag the public image, or at least to remove it from the host
  • rebuild all of the buster production images
  • start referencing the registry in the templates with a variable (that might require changes to docker-pkg).
  • add a periodic job on the build host to remove all docker images locally (this might be tricky)

Change 643228 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/docker-images/production-images@master] Rebuild all buster-based images

https://gerrit.wikimedia.org/r/643228

Change 643228 merged by Giuseppe Lavagetto:
[operations/docker-images/production-images@master] Rebuild all buster-based images

https://gerrit.wikimedia.org/r/643228

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

The point is consistency. We want to use the same registry when referencing images and saving them.

Anyways, I'll add a couple safeguards for the future and it should be enough for now.

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

The point is consistency. We want to use the same registry when referencing images and saving them.

Sure. My question is more like: Why are did we start using both names in first place and can we stop doing so. :)

Ouch.

Could we normalize everything to use the public image reference? That would also make local test more easy or straight forward.
Or do we gain a bit benefit by using the internal reference?

The point is consistency. We want to use the same registry when referencing images and saving them.

Sure. My question is more like: Why are did we start using both names in first place and can we stop doing so. :)

Couple of reasons:

  • To avoid having all kubernetes node go through the edge caches for fetching the images when deploys happen. Aside from the possibility of polluting the edge caches due to some misconfiguration kicking out of the cache a lot of hot content due to their size, there is also the issue of saturation (which we don't run right now, due to other bottlenecks in our infrastructure, but which we aim to address. T264209
  • To allow switching from easily eqiad to codfw registry and vice-versa without relying on the edge caching pool/depool logic, which is more involved than a simple confctl command.

Couple of reasons:

Okay, thanks. So it's more like we should not use the external reference then. I might have messed that up in some places...will check on that.

Change 643263 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] docker::images: tweak the build-base-images script

https://gerrit.wikimedia.org/r/643263

Change 643263 merged by Giuseppe Lavagetto:
[operations/puppet@production] docker::images: tweak the build-base-images script

https://gerrit.wikimedia.org/r/643263

Dzahn triaged this task as Medium priority.Nov 24 2020, 6:39 PM

Change 666116 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] deployment_server: Update default envoy image to 1.15.1-4

https://gerrit.wikimedia.org/r/666116

Change 666117 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/puppet@production] deployment_server: Update default prometheus-statsd-exporter image to 0.0.9

https://gerrit.wikimedia.org/r/666117

Change 666116 merged by JMeybohm:
[operations/puppet@production] deployment_server: Update default envoy image to 1.15.1-4

https://gerrit.wikimedia.org/r/666116

Change 666117 merged by JMeybohm:
[operations/puppet@production] deployment_server: Update default prometheus-statsd-exporter image to 0.0.9

https://gerrit.wikimedia.org/r/666117