Page MenuHomePhabricator

MW container image build workflow vs docker-registry caching
Closed, ResolvedPublic

Description

Executive summary

  • There is an HTTP caching proxy that sits in front of docker-registry.wikimedia.org.
  • The MW image build process may update tags.
  • HTTP clients (such as docker) which query the list of tags and/or the manifest for a tag may receive out-of-date information due to the cache.

This document is only concerned with mediawiki core/extension/etc commits merged into “train branches” (e.g. wmf/1.37.0-wmf.4), and commits to operations/mediawiki-config.

Workflow:
A change to mediawiki is merged into a train branch (e.g. wmf/1.37.0-wmf.4):

  1. The merge triggers a single-version image build.
  2. The single-version image will be tagged with the train branch (e.g. wmf-1.37.0-wmf.4) and pushed to the registry, probably updating an existing tag.
  3. The multiversion MW image build process is triggered.

Multiversion MW image build process:

  1. Read wikiversions.json from operations/mediawiki-config
  2. Copy in the contents of each unique single-version image mentioned in wikiversions.json.
  3. Push the constructed multiversion image to the registry with tag production, probably updating an existing tag.

Problem:
Let’s say a multiversion image build containing wmf.3 and wmf.4 has already run in the past (e.g. a few hours ago) and now a change has been backported to wmf.4. The new wmf.4 image is built and pushed to the registry. Now the multiversion build process tries to copy the files from the wmf.4 tagged image. Docker makes an HTTP request to the registry to see if a new image needs to be downloaded. The HTTP proxy sees a URL that it has seen before and returns the cached information which points to the old wmf.4 single-version image. The multiversion build process proceeds using the out-of-date image. Bad.

Possible solutions:

  1. Don’t cache certain accesses to the registry. It looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/691108 covers this option.
  2. Provide way to invalidate the cache for tags/manifests URLs
  3. Make all registry accesses through docker-registry.discovery.wmnet (suggested by @Joe) instead of docker-registry.wikimedia.org.
    1. Credentials are required for read access to docker-registry.discovery.wmnet
    2. The problem of out-of-date information for offsite read-only access still remains.
  4. Change the image build process so that tags are never reused. This means that the single-image build process will need to communicate new tags to the multi-version build process (since we can’t rely on asking the registry for the list of tags since we might get out-of-date cached info). This should be achievable if the building process is allowed to push and +2 a commit to operations/mediawiki-config.

Event Timeline

dancy triaged this task as High priority.May 13 2021, 9:58 PM
dancy updated the task description. (Show Details)
dancy added a subscriber: Joe.
dancy lowered the priority of this task from High to Low.May 13 2021, 10:11 PM
dancy updated the task description. (Show Details)

+1 to fixing the general problem, I don't know how much value we get from having ATS/varnish cache image layers vs just having requests pass through. But the fact that :latest tags don't work reliably is mildly annoying, I've run into it with CI images too.

If we want to keep that caching I think triggering purges for those URLs after pushing a new image makes the most sense to me, I guess it would need to be done in docker-pkg and blubber/pipelinelib.

If we want to keep that caching I think triggering purges for those URLs after pushing a new image makes the most sense to me, I guess it would need to be done in docker-pkg and blubber/pipelinelib.

We should do it in the registry itself:
https://docs.docker.com/registry/notifications/

Related: T256762: Fix nginx config and caching for docker registry and maybe also T264209: Run stress tests on docker images infrastructure.

If we want to keep that caching I think triggering purges for those URLs after pushing a new image makes the most sense to me, I guess it would need to be done in docker-pkg and blubber/pipelinelib.

We should do it in the registry itself:
https://docs.docker.com/registry/notifications/

Ooh, neat.

dancy raised the priority of this task from Low to High.May 13 2021, 11:32 PM

After changing from docker-registry.wikipedia.org to docker-registry.discovery.wmnet I get the following error during the image build process:

Step 9/19 : COPY --chown=65533:65533 --from=docker-registry.discovery.wmnet/wikimedia/mediawiki:wmf-1.37.0-wmf.4 ["/srv/mediawiki", "/srv/mediawiki/php-1.37.0-wmf.4"]
16:26:21  invalid from flag value docker-registry.discovery.wmnet/wikimedia/mediawiki:wmf-1.37.0-wmf.4: Get https://docker-registry.discovery.wmnet/v2/wikimedia/mediawiki/manifests/wmf-1.37.0-wmf.4: no basic auth credentials

I know /usr/local/bin/docker-pusher handles credentials during pushes but it seems that the same creds need to be made to the docker daemon in general for pulls. Or something.

After changing from docker-registry.wikipedia.org to docker-registry.discovery.wmnet I get the following error during the image build process:

Step 9/19 : COPY --chown=65533:65533 --from=docker-registry.discovery.wmnet/wikimedia/mediawiki:wmf-1.37.0-wmf.4 ["/srv/mediawiki", "/srv/mediawiki/php-1.37.0-wmf.4"]
16:26:21  invalid from flag value docker-registry.discovery.wmnet/wikimedia/mediawiki:wmf-1.37.0-wmf.4: Get https://docker-registry.discovery.wmnet/v2/wikimedia/mediawiki/manifests/wmf-1.37.0-wmf.4: no basic auth credentials

I know /usr/local/bin/docker-pusher handles credentials during pushes but it seems that the same creds need to be made to the docker daemon in general for pulls. Or something.

Yes, you need to use the same credentials you use to push the images. The internal registry is private and thus requires auth for both pulling and uploading.

The single-version image will be tagged with the train branch (e.g. wmf-1.37.0-wmf.4) and pushed to the registry, probably updating an existing tag.

Or probably not updating an existing tag. Mutable image tags are a huge problem anyway (see the usual rant in https://vsupalov.com/docker-latest-tag/ ). Especially in the case of seemingly well versioned tags, where presumably the barriers more experienced people have built around mutable tags like latest/production etc are being lowered, mutating them is a recipe for a lot of pain (been there, done that, it's a pain)

Push the constructed multiversion image to the registry with tag production, probably updating an existing tag.

While a mutable tag has some uses (e.g. CI always knowing it pulls what's the latest/stable), we should most definitely not use a mutable tag for deployments (unless we want to enter the realm of "did it really update? or not? how do I check?"), which has me begging the question of what a production tag would be used for.

Change 702225 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] Add empty train-versions.json

https://gerrit.wikimedia.org/r/702225

Change 702225 merged by jenkins-bot:

[operations/mediawiki-config@master] Add empty train-versions.json

https://gerrit.wikimedia.org/r/702225

Change 702468 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/release@master] Initialize state/train-versions.json file

https://gerrit.wikimedia.org/r/702468

Change 702468 merged by jenkins-bot:

[mediawiki/tools/release@master] Initialize state/train-versions.json file

https://gerrit.wikimedia.org/r/702468

Change 702704 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/mediawiki-config@master] Use train-versions.json to map from version to image tag

https://gerrit.wikimedia.org/r/702704

Change 702704 merged by jenkins-bot:

[operations/mediawiki-config@master] Use train-versions.json to map from version to image tag

https://gerrit.wikimedia.org/r/702704

Mentioned in SAL (#wikimedia-operations) [2021-07-01T22:31:12Z] <dancy@deploy1002> Synchronized .pipeline: Config: [[gerrit:702704|Use train-versions.json to map from version to image tag (T282824)]] (duration: 00m 57s)

dancy claimed this task.