Page MenuHomePhabricator

docker-registry.wikimedia.org keeps serving bad blobs
Open, HighPublicBUG REPORT

Description

The manifest for docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-03-27-200753-publish-81 references an image layer blob sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59:

...
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 1119805497,
         "digest": "sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59"
      },
...

The registry returns a blob of the right length but wrong hash:

$ curl -v -n https://docker-registry.wikimedia.org/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59 > sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59

...
< content-type: application/octet-stream
< content-length: 1119805497
< docker-content-digest: sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
< docker-distribution-api-version: registry/2.0
< etag: "sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59"
...
< age: 0
< x-cache: cp1110 pass, cp1110 pass
< x-cache-status: pass
< server-timing: cache;desc="pass", host;desc="cp1110"
...

$ ls -l sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
-rw-rw-r-- 1 dancy wikidev 1119805497 Mar 28 00:21 sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59

That is the expected size of the blob from the manifest.

$ sha256sum sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
da9e1fa86230529d142f8385e1d13f8cbe308bf970a0a25607512de08f47ad29  

But the hash doesn't match.

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
labs/privatemaster+16 -16
operations/puppetproduction+1 -1
labs/privatemaster+0 -18
operations/puppetproduction+2 -2
operations/puppetproduction+31 -1
operations/puppetproduction+2 -2
operations/puppetproduction+189 -99
operations/puppetproduction+8 -9
operations/puppetproduction+120 -120
operations/puppetproduction+13 -0
labs/privatemaster+16 -1
operations/puppetproduction+7 -12
operations/puppetproduction+1 -0
operations/puppetproduction+0 -1
operations/puppetproduction+2 -1
operations/puppetproduction+1 -1
operations/puppetproduction+33 -20
operations/puppetproduction+3 -0
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
build-images.py: Don't sleep after full build in train-devrepos/releng/release!189dancymain-Ia9831d65ecb031642f8cb47e672a6520eed581cemain
make-container-image: sleep only after full mediawiki image buildrepos/releng/release!165swfrenchwork/swfrench/T390251-pause-on-full-buildmain
build-images.py: Temp sleep for swift consistencyrepos/releng/release!164cgoubertT390251main
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

swfrench opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/165

make-container-image: sleep only after full mediawiki image build

swfrench merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/165

make-container-image: sleep only after full mediawiki image build

Mentioned in SAL (#wikimedia-operations) [2025-04-16T19:30:25Z] <swfrench@deploy1003> Started scap sync-world: Test stop-before-sync scap run to pick up make-container-image changes - T390251

IIRC it caused more nginx 500 errors due to tmpfs exhaustion, we decided to rollback since it was making it really worse.

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

It's 20.10.24+dfsg1-1+deb12u1 vs 20.10.5+dfsg1-1+deb11u4, so at least judging by version number there shouldn't be some huge change. That being said, I did scan through https://docs.docker.com/engine/release-notes/20.10/#20105 and I see the following interesting one only.

  • Add retry on image push 5xx errors. moby/buildkit#2043.

But that's about it. Everything else seems unrelated to pushing of images. I 'll see if it is already included in the +deb12u1 Debian patches. Trixie btw has 26.1.5+dfsg1-9+b2, but that's not yet released. We could go for it, but I 'd like something more concrete before we open that can of worms, since I 've already grepped (NOT read, not even skimmed) release notes for versions 23 to 26.1 and the words builder and push yielded very few results, none of them promising.

Thanks for the research, indeed it seems something not useful, but I'd try to upgrade anyway. It happened to me in the past that useful commits were not listed in main changelogs, so even if there is something that improves the current situation it would be worth it in my opinion.

Now, I am mildly spit balling here, but we could spin up a secondary registry process, with Ceph+S3 backend and use it just for the /restricted part of the registry. This would move us away from the Swift Driver and the possible eventual consistency issues we are seeing. It would move us into possibly similar problems on the S3 driver and ceph issues of course. IIUC, all we need is a different port for a second instance of the daemon, a different configuration file and we might be able to test. Does that sound plausible/sensible?

I thought about the same, and it would be good to also get experience with S3 (that we'll have to use for any Docker Registry solution that we'll choose). it is a sizeable amount of work though, it will not probably happen soon. Let's find volunteers first :D (and/or establish a group of people willing to work on this cross teams).

Recap: after Clement's and Scott's patches it seems that we have temporarily fixed the consistency issue while pulling, that in my opinion was the bulk of the pain. While we think about improving the sleep time, it would be really nice if scap implemented its own ad-hoc push serialization logic, to verify that the second issue is indeed related to pushing too many Docker layers concurrently to the registry (of various sizes etc..).

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

As a short-term mitigation, it seems reasonable, and serializing the image pushes in build-images.py should be straightforward.

Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

Yes, we really need to upgrade Docker. I would also like to get docker-buildx-plugin installed so we can make use of BuildKit-only features in our build.

Related to this, there are 2G of completely redundant layers in our PHP 7.4/8.1 "flavored" images due to the independent rsyncs of the same /srv/mediawiki-staging tree for each "flavor", and I'm wondering if eliminating this redundancy would help mitigate the registry storage issue.

Interestingly the tarball entries as well as the extracted contents of the two layers are identical. However, after writing a small Go program to compare the metadata between the entries of each layer tarball, I found that it's the mtime of /workdir (the bind mount target of /srv/mediawiki-staging during the rsync) and the mtime of /srv that differ.

 $  tardiff 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar
--- 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar:srv/
+++ 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar:srv/
@@ -8,7 +8,7 @@
 Gid: (int) 0,
 Uname: (string) (len=4) "root",
 Gname: (string) (len=4) "root",
-ModTime: (time.Time) 2025-04-21 06:09:20 -0700 PDT,
+ModTime: (time.Time) 2025-04-21 06:09:18 -0700 PDT,
 AccessTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 Devmajor: (int64) 0,
--- 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar:workdir/
+++ 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar:workdir/
@@ -8,7 +8,7 @@
 Gid: (int) 0,
 Uname: (string) (len=4) "root",
 Gname: (string) (len=4) "root",
-ModTime: (time.Time) 2025-04-21 06:09:29 -0700 PDT,
+ModTime: (time.Time) 2025-04-21 06:09:22 -0700 PDT,
 AccessTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 Devmajor: (int64) 0,

If it weren't for these two mtime differences, the MW code layers would be identical, and I'm pretty sure we would save 2G of blob uploads (and download for each node). (What's a bit funny is that train-dev produces images without these differences in the MW code layers. I suspect this is because the rsync processes happen to finish during the exact same second, so obviously this is not a reliable way to get identical layers.)

IMO the best way to refactor build-images.py for layer efficiency (and possibly to mitigate this issue by reducing total image size) would be:

  1. Build a single "code only" image using the incremental rsync-based process.
  2. Push the code image immediately to allow the registry to start persisting it.
  3. After the code image is pushed, start to build the webserver, web app, cli, and debug images in a single BuildKit context and merge in the code image layers using COPY --link --from=mediawiki-code / / (BuildKit-only feature that achieves a deterministic reuse of existing layers).
  4. Push images (serially, still, if pushing the code image first isn't enough).

I've been working on this refactor a bit already, I'm hoping to have something to push up for evaluation/review this week. But again, it would be blocked on a newer Docker (with the default BuildKit builder) so whatever we can do to get that upgraded would be great.

I will also see about getting the serial image push change done tomorrow.

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

As a short-term mitigation, it seems reasonable, and serializing the image pushes in build-images.py should be straightforward.

In our team meeting yesterday, @dancy shared a fix for the swift storage driver! I'll let him update you all on the details, and I'll hold off on the other workarounds for now.

IMO the best way to refactor build-images.py for layer efficiency (and possibly to mitigate this issue by reducing total image size) would be:

  1. Build a single "code only" image using the incremental rsync-based process.
  2. Push the code image immediately to allow the registry to start persisting it.
  3. After the code image is pushed, start to build the webserver, web app, cli, and debug images in a single BuildKit context and merge in the code image layers using COPY --link --from=mediawiki-code / / (BuildKit-only feature that achieves a deterministic reuse of existing layers).
  4. Push images (serially, still, if pushing the code image first isn't enough).

I'm moving ahead with this idea in T392526: Refactor `build-images.py` to use a common code image and `docker buildx` for other reasons that I explain in the task, but no longer as a mitigation for this issue.

Hi all. I have prepared a sequence of 3 commits against the v2.8.3 tag of https://github.com/distribution/distribution/. They are located in the swift-hacks branch of my fork of that repo at https://github.com/dancysoft/distribution/commits/swift-hacks/. These changes address two problems:

  • Bad data delivered by the registry:

This is the most important problem. My changes address this problem by adding headers to blobs so the registry can reliably determine if it is looking at a consistent blob or not when reading. The registry will now never deliver data from an inconsistent blob.

  • Unreliable uploads:

It is possible that an upload that is created in one moment may not been seen the moment after (Swift does not offer read-after-write consistency) . Compensate for this by passing the X-Newest header on non-blob object retrievals from Swift. This ensures that the most up-to-date available information is used, reducing the likelihood of read-after-write inconsistencies. This comes at the cost of increased load on replicas for these types of accesses. That will be something to keep an eye on.

I recommend that we run a custom build of the registry with these changes to resolve the problems in this ticket for the time being. The ConsistencyTimeout setting should be set to a value that allows plenty of time for replication to complete.

Btw the changes mentioned above have been reviewed by @dduvall and @Scott_French .

I quickly checked the patches and the refactoring work is really nice, and I like the two new options. Really good job :)

I am a bit on the fence about how to deploy this in production, since we may encounter unexpected bugs for other use cases (like non-scap image push/pull etc..). Upstream will not review the patches since the swift backend is decommed, so we'll have to maintain our own fork for the time being. To safely test the new changes, it would be easy enough to spin up a new registry stack (VM + Redis instance + Swift bucket) and point scap and other tools to it incrementally, to quickly test if everything is good or not. But on the other hand, we are already planning to move away from Swift in favor of S3, so if we have to create a new stack I'd start with S3 directly.

The alternative is to build and deploy the new deb, deploy in prod and check how it goes during the next days, something that I am not totally fond of but it may be necessary (I am aware that Ahmon have probably tested multiple cases locally already, but the risk of unexpected side effects is not zero).

@akosiaris @Scott_French - lemme know if you have a preference, but this is a sign for me that we should start working on a medium term solution for the registry (even before talking about switching to something else). Maybe simply moving to apus/S3 could be enough?

I quickly checked the patches and the refactoring work is really nice, and I like the two new options. Really good job :)

I am a bit on the fence about how to deploy this in production, since we may encounter unexpected bugs for other use cases (like non-scap image push/pull etc..). Upstream will not review the patches since the swift backend is decommed, so we'll have to maintain our own fork for the time being. To safely test the new changes, it would be easy enough to spin up a new registry stack (VM + Redis instance + Swift bucket) and point scap and other tools to it incrementally, to quickly test if everything is good or not. But on the other hand, we are already planning to move away from Swift in favor of S3, so if we have to create a new stack I'd start with S3 directly.

The alternative is to build and deploy the new deb, deploy in prod and check how it goes during the next days, something that I am not totally fond of but it may be necessary (I am aware that Ahmon have probably tested multiple cases locally already, but the risk of unexpected side effects is not zero).

@akosiaris @Scott_French - lemme know if you have a preference, but this is a sign for me that we should start working on a medium term solution for the registry (even before talking about switching to something else). Maybe simply moving to apus/S3 could be enough?

We 've discussed this thing yesterday in the ServiceOps new meeting. The 2 paths aren't particularly different, at least in the amount of work that is required to implement them. Whether we configure the /restricted namespace to go to an instance of the registry patched using @dancy's patches or a second instance of the same unpatched software using S3 as a backend, the configuration and testing side of the work needed is the same. What differs is:

  • the amount of work making sure APUs is up to the task for the S3 solution
  • the amount of work needed to build and package the software.

The latter is admittedly less, however it is also the one where we 'd be just putting a stopgap while also having to work on the former to put us on a sustainable path. So, we think we should go straight to the APUs/S3 solution, gambling that it will pay off more or less immediately, leaving the other approach as a hedge in case it doesn't.

Opened T394476 to see if apus can take over the current load that we have on Swift. After the sign-off we'll be able to reason about concrete next steps.

Change #1154301 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] registry: Minor Puppet cleanups

https://gerrit.wikimedia.org/r/1154301

Change #1154302 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry_ha: Refactor to make it docker_registry

https://gerrit.wikimedia.org/r/1154302

Change #1154301 merged by Alexandros Kosiaris:

[operations/puppet@production] registry: Minor Puppet cleanups

https://gerrit.wikimedia.org/r/1154301

Change #1155257 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Move rsyslog rules from init to web.pp

https://gerrit.wikimedia.org/r/1155257

Change #1155258 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Refactor to allow >1 instance

https://gerrit.wikimedia.org/r/1155258

Change #1155601 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Rename docker_registry_ha's occurrences to docker_registry

https://gerrit.wikimedia.org/r/1155601

Change #1156761 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[labs/private@master] registry: Add hiera for the new hierarchy

https://gerrit.wikimedia.org/r/1156761

Change #1156762 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[labs/private@master] Remove old docker_registry_ha hiera keys

https://gerrit.wikimedia.org/r/1156762

Change #1156761 merged by Alexandros Kosiaris:

[labs/private@master] registry: Add hiera for the new hierarchy

https://gerrit.wikimedia.org/r/1156761

Change #1156767 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] pontoon: Add stack registry

https://gerrit.wikimedia.org/r/1156767

Change #1156767 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: Add stack registry

https://gerrit.wikimedia.org/r/1156767

Change #1154302 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry_ha: Refactor to make it docker_registry

https://gerrit.wikimedia.org/r/1154302

Change #1155257 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Move rsyslog rules from init to web.pp

https://gerrit.wikimedia.org/r/1155257

Change #1155258 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Refactor to allow >1 instance

https://gerrit.wikimedia.org/r/1155258

Mentioned in SAL (#wikimedia-operations) [2025-06-13T11:41:47Z] <akosiaris> T390251 re-enable puppet on registry1004 after merging puppet refactoring changes.

Change #1156809 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Make sure ports are in the right format

https://gerrit.wikimedia.org/r/1156809

Change #1156809 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Make sure ports are in the right format

https://gerrit.wikimedia.org/r/1156809

Mentioned in SAL (#wikimedia-operations) [2025-06-13T12:21:19Z] <akosiaris> T390251 re-enable puppet on all registries.

Change #1156829 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Instantiate APUs s3 backend instance

https://gerrit.wikimedia.org/r/1156829

Change #1156835 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Pass defaults to 2 option parameters

https://gerrit.wikimedia.org/r/1156835

Change #1156835 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Pass defaults to 2 option parameters

https://gerrit.wikimedia.org/r/1156835

Change #1156829 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Instantiate APUs s3 backend instance

https://gerrit.wikimedia.org/r/1156829

Change #1159291 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update Cumin alias for Docker registry

https://gerrit.wikimedia.org/r/1159291

Change #1156762 merged by Alexandros Kosiaris:

[labs/private@master] Remove old docker_registry_ha hiera keys

https://gerrit.wikimedia.org/r/1156762

Change #1159291 merged by Alexandros Kosiaris:

[operations/puppet@production] Update Cumin alias for Docker registry

https://gerrit.wikimedia.org/r/1159291

Change #1155601 abandoned by Alexandros Kosiaris:

[labs/private@master] Rename docker_registry_ha's occurrences to docker_registry

Reason:

I think https://gerrit.wikimedia.org/r/c/labs/private/+/1156761/ and https://gerrit.wikimedia.org/r/c/labs/private/+/1156762 cover this as well, I 'll abandon, but feel free to restore, if I missed something.

https://gerrit.wikimedia.org/r/1155601

bd808 changed the subtype of this task from "Task" to "Bug Report".Aug 12 2025, 11:03 PM

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

The latter. The backend is setup and we 've had a couple of successful pushes, but also some weird behavior we want to reproduce and investigate. The plan is to switch just mediawiki's /restricted part to the new registry once we are confident

The /var/lib/nginx path is a separate mount point, using tmpfs, with a size of 4G.

Apparently this is too small for known work loads as today there was an issue pushing to the registry when this ran out of disk.

After the push worked at a later point this was back to 0 usage. (observed on registry2004)

16:45 < cdanis> 2025/09/16 15:56:55 [crit] 1731183#1731183: *1091214 pwrite() "/var/lib/nginx/body/0000003080" failed (28: No space left on device), client: 10.64.16.93, server: , request:  "PATCH /v2/restricted/mediawiki-multiversion/blobs/uploads....

My understanding from the last updates is that we are not actively pushing anything to the new registry with apus as backend, we have just quickly tested some months ago. Is it the right understanding?

After T406392 I think that we really exhausted the amount of engineering hours that are tolerable to debug an issue, and we should really focus on forming a working group that moves us away from Swift. It is not an easy and quick task, so we may want to reconsider the option of patching the current registry's swift driver as interim solution (previous some testing of course).

Lemme know your thoughts! Happy to coordinate the efforts for a WG if needed.

per radosgw-admin user stats --uid=docker-registrythere are only 11 objects in that account, which I think equates to it not being currently used.

T406392 is a good reminder of the fact that the bandaids we might otherwise fall back on (e.g., sleeps, internal retries) are not available in all contexts, so even though we've largely focused on the MediaWiki image use case here, we really need a more systematic solution.

So yes, +1 to prioritizing either / both of migrating away from Swift (clearly a longer-term effort) and reviving Ahmon's driver improvements (also non-trivial to test / deploy, but still faster), though this will likely need to wait until the new year for any significant progress.

@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).

@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).

I don't see how else we can solve this without some dedicated and focused time on this one. It requires some discovery and testing and that takes time. Finding that time is the difficult part, I 'll raise this to management level and see how we can resource it.

Change #1225526 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker_registry: turn off backend redirects for Swift

https://gerrit.wikimedia.org/r/1225526

Change #1225526 merged by Elukey:

[operations/puppet@production] profile::docker_registry: turn off backend redirects for Swift

https://gerrit.wikimedia.org/r/1225526

Mentioned in SAL (#wikimedia-operations) [2026-01-13T11:03:27Z] <elukey> disable HTTP redirects to the Swift backend for all the Docker registries - T390251

Mentioned in SAL (#wikimedia-operations) [2026-01-13T16:41:40Z] <elukey> roll restart docker-registry-swift daemons on registry* to pick up the new settings (apparently the service refresh issued by puppet didn't work as intended) - T390251

I've merged T412265: Pushing to the docker registry fails with 500 Internal Server Error into this task, as we believe it's another manifestation of the same class of failure modes discussed here.

One key point of note from the investigation on that task:

In T412265#11473181, we identified a correlation between certain swift operations (i.e., new ms-be hosts loading) that could plausibly lead to metadata "churn" in the cluster, thus also to more pronounced eventual consistency, and both the March 2025 period of issues captured originally here and the December 2025 period reported in T412265.

This is of course not a cause, but is a plausible trigger for the behavior we've seen. Our understanding of the cause remains the combination of eventual consistency in swift and poor design of the swift backend driver in the docker registry, which we intend to address by moving to S3-on-Ceph (T412951).

I've now also merged T406392, for the same reason.

One key point of note from that task is that buildkit, as used in the Gitlab CI image build / push jobs, does not have the same retry behavior as we've seen with dockerd (see e.g., T406392#11252870).

Meaning, although retries can in some cases paper over eventual-consistency related issues for MediaWiki image pushes, we won't generally see that during push failures on the Gitlab CI side of things.

Mentioned in SAL (#wikimedia-operations) [2026-02-26T09:47:50Z] <elukey> move the Docker Registry's /v2/restricted (MediaWiki Docker image prefix) to s3/apus - T390251

dancy opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/236

make-container-image/build-images.py: Remove 5 minute sleep after full image build push

dancy merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/236

make-container-image/build-images.py: Remove 5 minute sleep after full image build push

To keep archives happy - we solved the main painful problem, namely upload/push of MediaWiki docker images in T412951. The long term strategy is to move away from swift in favor of S3, but we should be mindful of the work tracked in T413080.

Left a note in T413080#11688938 about a possible strategy.