Page MenuHomePhabricator

docker-registry.wikimedia.org keeps serving bad blobs
Open, HighPublicBUG REPORT

Description

The manifest for docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-03-27-200753-publish-81 references an image layer blob sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59:

...
      {
         "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
         "size": 1119805497,
         "digest": "sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59"
      },
...

The registry returns a blob of the right length but wrong hash:

$ curl -v -n https://docker-registry.wikimedia.org/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59 > sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59

...
< content-type: application/octet-stream
< content-length: 1119805497
< docker-content-digest: sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
< docker-distribution-api-version: registry/2.0
< etag: "sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59"
...
< age: 0
< x-cache: cp1110 pass, cp1110 pass
< x-cache-status: pass
< server-timing: cache;desc="pass", host;desc="cp1110"
...

$ ls -l sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
-rw-rw-r-- 1 dancy wikidev 1119805497 Mar 28 00:21 sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59

That is the expected size of the blob from the manifest.

$ sha256sum sha256:e7b2287766dc2a93ea9014f37470ba45fe8afcfb095221fd6ed3ed2db19c7c59
da9e1fa86230529d142f8385e1d13f8cbe308bf970a0a25607512de08f47ad29  

But the hash doesn't match.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
build-images.py: Don't sleep after full build in train-devrepos/releng/release!189dancymain-Ia9831d65ecb031642f8cb47e672a6520eed581cemain
make-container-image: sleep only after full mediawiki image buildrepos/releng/release!165swfrenchwork/swfrench/T390251-pause-on-full-buildmain
build-images.py: Temp sleep for swift consistencyrepos/releng/release!164cgoubertT390251main
Customize query in GitLab

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This evening, I tried to pull together a rough timeline of the issues between 14:00 and 17:00 UTC today in https://phabricator.wikimedia.org/P74821 (note: time goes backwards!). That includes a manual merge of "relevant" log lines sampled from a handful of sources, including:

  • scap run start / end logs
  • scap image build-and-push logs, if available (it won't be if the build was interrupted)
  • deploy1003 dockerd logs
  • registry2004 nginx error logs
  • k8s event logs indicating failed image pull due to digest mismatch for sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37

I've not yet tried to merge the above with docker-registry logs.

I did look in the nginx access logs as well, but there's not much of interest specifically dealing with sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 - e.g., during the failed deployment where the blob was ostensibly "bad" I just see lots of seemingly successful (206) responses served to the set of k8s workers that appear to be pulling the blob (in 15MiB ranges).

In any case, there's a _lot_ going in there, but some interesting points of note:

  • sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 is observed to be corrupt on both wikikube-worker1256 and wikikube-worker2025, however they observe different digests.
  • During the parallel uploads that eventually succeed, but appear to result in serving bad blobs, we do see instances of /var/lib/nginx running out of space. As usual, these correlate with the Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error errors from dockerd.
  • During the failing pull, we do see some potentially interesting nginx errors: upstream prematurely closed connection while reading upstream. This affects one of the k8s workers per DC that are pulling the 15MiB ranges (note: these are not the workers where the pods are scheduled), though there is no corresponding error in the access logs interestingly.
  • I'm pretty sure the Not continuing with push after error: context canceled dockerd logs we've seen today are just the result of when we interrupt / abort the push.

I need to think on this a bit, but it feels like we need to shift focus to the interactions between the registry and dragonfly. We already know that the registry will - even very shortly after "serving a bad blob" - turn around and hand us a perfectly normal one (and then a subsequent deploy a short time later succeeds).

It's unfortunate that there seem to be a number of different things happening, that make it challenging to separate cause and effect - e.g., it would be extremely surprising if transient push failures due to /var/lib/nginx space are somehow the reason for the issues we're seeing, but can they can certainly be the result of the same underlying trigger (which at this point is pretty clearly large blobs).

I think as well that Dragonfly may need to be checked..

Affected by digest mismatch:

root@wikikube-worker1256:/var/lib/dragonfly-dfdaemon/logs# grep 52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 dfdaemon.log 
2025-04-09 16:45:59.271 INFO sign:3302045 : start download url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet to c20c03e4-9d8e-48f8-a112-4d6a2bd33d04 in repo
2025-04-09 16:46:24.934 INFO sign:3302045 : dfget url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet [SUCCESS] cost:25.663s

Not affected:

root@wikikube-worker1257:/var/lib/dragonfly-dfdaemon/logs# grep 52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 dfdaemon.log
2025-04-09 16:35:43.688 INFO sign:3514205 : start download url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet to bab613d8-3520-4c56-b750-01a31d650bfc in repo
2025-04-09 16:36:38.710 INFO sign:3514205 : dfget url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet [SUCCESS] cost:55.022s

root@wikikube-worker1260:/var/lib/dragonfly-dfdaemon/logs# grep 52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 dfdaemon.log
2025-04-09 16:44:25.616 INFO sign:1608023 : start download url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet to 7769bf33-a360-4a0e-a5f0-8c4103b339aa in repo
2025-04-09 16:44:54.236 INFO sign:1608023 : dfget url:https://docker-registry.discovery.wmnet/v2/restricted/mediawiki-multiversion/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet [SUCCESS] cost:28.620s

Very interesting that I don't see anything on the eqiad supernode:

elukey@dragonfly-supernode1001:~$ sudo grep 52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 /var/lib/dragonfly-supernode/logs/app-2025-04-09T21-35-37.480.log 
elukey@dragonfly-supernode1001:~$

But the logs start from 2025-04-09 17:18: that is weird, we may have some wrong logrotation/retention going on.

Anyway, I checked for bug reports in the Dragonfly repo, and I found something that is very close to what we are seeing: https://github.com/dragonflyoss/dragonfly/issues/784

From what I can see the repository changed name from 1.0.6, and I don't find that tag anymore. But the pull request is after the 2.0.0 release (https://github.com/dragonflyoss/dragonfly/releases/tag/v2.0.0), so there shouldn't be any possibility to backport.

I took a closer look at the registry nginx access logs this morning, in hopes that I could understand a bit better (1) what the dragonfly fetch looks like on that side and (2) whether there are any obvious differences between the pulls of /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37 in the failing (~ 16:06) and succeeding (~ 16:24) cases.

In short, I am puzzled, and probably need to spend more time with the source code for the version of dragonfly we're using.

There are a handful of things that make sense. For example, in both cases, we see dragonfly on 2 workers per DC fetching that path in 15 MiB response size chunks. We also see some number of 1012675 byte responses, which is the tail of the blob (2203022275 % 15728640 = 1012675). We also see some requests from the supernodes, though not fetching the full content (presumably to infer size in some way, though interestingly issuing a GET rather than a HEAD).

What's interesting is that the total number of bytes fetched varies between the two cases, neither of which appears to be an integer multiple of the total blob size. Similarly, the number of tail fetches varies. This is probably an artifact of how blob "part" ranges are allocated to peers, but ... I don't know as of now.

The one very clear thing is that only the failing fetch appears to have these upstream prematurely closed connection while reading upstream errors reported in the nginx error logs:

# registry2004
2025/04/09 16:06:14 [error] 134692#134692: *546689 upstream prematurely closed connection while reading upstream, client: 10.192.5.25, server: , request: "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet HTTP/1.1", upstream: "http://127.0.0.1:5000/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet", host: "docker-registry.discovery.wmnet"
2025/04/09 16:06:26 [error] 134691#134691: *546643 upstream prematurely closed connection while reading upstream, client: 10.64.16.69, server: , request: "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet HTTP/1.1", upstream: "http://127.0.0.1:5000/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet", host: "docker-registry.discovery.wmnet"
2025/04/09 16:06:27 [error] 134692#134692: *547082 upstream prematurely closed connection while reading upstream, client: 10.64.16.69, server: , request: "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet HTTP/1.1", upstream: "http://127.0.0.1:5000/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet", host: "docker-registry.discovery.wmnet"

# registry2005
2025/04/09 16:06:27 [error] 765393#765393: *21223 upstream prematurely closed connection while reading upstream, client: 10.64.0.81, server: , request: "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet HTTP/1.1", upstream: "http://127.0.0.1:5000/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet", host: "docker-registry.discovery.wmnet"

... and from the access logs, the responses sizes are strange: these are 206 response code like everything else, but each of these corresponds with a respond size that is neither 15728640 nor 1012675 - in most cases it's 0 size, and in one case it's 6255555.

That's pretty suspicious!

If we look at the corresponding docker registry logs, there are no errors reported - e.g.,

Apr 09 16:06:14 registry2004 docker-registry[655]: time="2025-04-09T16:06:14.813393097Z" level=info msg="response completed" go.version=go1.19.8 http.request.host=docker-registry.discovery.wmnet http.request.id=7ed47b94-a21f-4a18-ba2a-67fed22c2b74 http.request.method=GET http.request.remoteaddr=127.0.0.1 http.request.uri="/v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet" http.request.useragent="containerd/1.6.20~ds1" http.response.contenttype=application/octet-stream http.response.duration=155.104955ms http.response.status=206 http.response.written=6255555
Apr 09 16:06:14 registry2004 docker-registry[655]: 127.0.0.1 - - [09/Apr/2025:16:06:14 +0000] "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37?ns=docker-registry.discovery.wmnet HTTP/1.1" 206 6255555 "" "containerd/1.6.20~ds1"

In any case, when one of these issues is happening, it seems like either (1) the registry is returning unexpected sizes for certain ranges and dragonfly is accepting them or (2) dragonfly is explicitly requesting unexpected ranges. Given the correlation with nginx errors, #1 seems more likely.

Edit: Looking again at the docker-registry swift driver, there's some interesting edge cases for range requests here. For example, if swift returns a StatusRequestedRangeNotSatisfiable (e.g., the offset from which we are reading is not (yet) available), the StorageDriver.Reader implementation seems swallow the error and return an empty bytes.Reader. I wonder if that could explain the 0-length 206 responses. Also, more generally, the way storage.blobServer.ServeBlob works means that if the underlying Reader (i.e., the one returned by the driver, wrapped by a storage.fileReader) encounters an error (e.g., EOF) while content is being copied to the response writer, and this happens before reaching the Content-Length that's already been sent downstream, nginx would emit the error we see here (see http.ServeContent and its use of io.CopyN).

Thanks a lot for the ton of good details Scott! Reviewing your comments made me wonder something high level, lemme know what you think about it.

We are currently facing two issues:

  1. When we push new images: if we serialize or not in Docker, we end up in retrying multiple times before the image is accepted by the Registry.
  2. When we pulll new images: wikikube workers are randomly getting incorrect binary blobs when requesting certain layers, and deployments fail as a consequence.

The most problematic issue seems 2). Regardless of where the inconsistency orginates, it seems to me that:

  1. This happens only with mediawiki images, more specifically when we run deployments via scap.
  2. The inconstency eventually disappears, and retrying a deployment multiple times is the only thing that bypasses the problem.

Scott's analysis made me wonder why I haven't seen this issue before, with other big images, and something came to mind - it never happened to me to build an image and then immediately deploy it via helmfile (like scap does). It usually pass some time between the image getting uploaded to the registry and the actual helmfile deploy command (usually because we need to file patches for deployment-charts etc..). So what if:

  1. The docker distribution swift driver is bugged (like Scott pointed out) and upon some conditions, like uploading a lot of new layers at the same time, it returns partial content until it reaches consistency between its nodes (so read-after-write is not consistent immediately). If you look for swift reports in the docker distribution upstream repo there are a lot of people complaining about it being buggy and inconsistent.
  2. Dragonfly propagates the inconsistent binary, that gets fed to containerd and hence new pods being created.
  3. After 5 mins of inactivity, Dragonfly's cache is purged automatically (while we write on IRC and wonder what's happening).

So the chance of getting a failed deployment is basically due to the swift eventual consistency, and the time that it takes for the content to be propagated correctly.

Quick/easy ideas to test:

  1. We could add an artificial/configurable wait time in scap between building and pushing, like 5 minutes, to see if the issue keeps happening or not. It would be really annoying for deployers, but even the current experience is not great. If we run it for two weeks and not issue comes up, we'll have a better idea of where the problem lies.
  2. We could patch the docker-distribution swift driver (currently deprecated and removed in 3.x, released last week) to allow timers to be configurable (like https://github.com/distribution/distribution/blob/release/2.8/registry/storage/driver/swift/swift.go#L56) and see if they play a role in this mess. Maybe 15s is not enough, we need 30/60, and that's all that it takes to improve the status quo.

More invasive and expensive tests:

  1. Upgrade Dragonfly, we have an ancient version and we've already seen reports of bad behavior like https://github.com/dragonflyoss/dragonfly/issues/784 (that fits what we are experiencing now).
  2. Move the Docker Distribution config to the S3 driver. This would mean moving away from mw-swift, since we don't have the S3 API enabled in there and it would be a pretty major change. We could target APUS (ceph + S3), but it is not an easy move. On the plus side, we'd need do it anyway, but if we change Docker Distribution with something else in the future we may have wasted time for nothing.

Thanks, Luca!

This morning, I checked the swift-proxy access logs on ms-fe2010 and ms-fe2012, which are the two hosts emitting 416 responses in the 16:06 minute per the swift_proxy_server_object_seconds_count metric.

All three cases of zero-size 206 responses from docker-registry to dragonfly, which correlate with the upstream prematurely closed connection while reading upstream nginx errors, also correlate with 416 responses from swift for [...]/v2/blobs/sha256/52/52c8ffea230bf6fd62801737a3713b339a307d6aa39c8f2a2d69c725ad05ea37/data.

Together, this is consistent with the behavior we would expect from the swift StorageDriver + downstream blobServer implementations. While I've not yet more widely queried for a swift-proxy response that might explain the oddly sized partial registry response (6255555 bytes), I think it's safe to bet that swift simply returned fewer bytes then expected.

Which is to say, at least for this one data point, I think we have a pretty consistent picture coming together: for some amount of time after a large layer blob upload ostensibly completes, swift may not yet be able to serve its full extent. Here, this manifests as reading at should-be-valid offsets occasionally returning 416, or a shorter response than expected.

These corrupt blob "parts" are retained and propagated by dragonfly, at least until the 5 minute idle timeout expires and the cache is cleared. I've not looked closely at the relevant dragonfly code yet, but it is curious that truncated (vs. the expected Range size) "parts" are accepted. Then again, it does seem like at least the early version of dragonfly we're using may be weak on validation.

At the expense of slowing deployments a bit, the idea of an experiment where we pause between the build/push phase and deployment in order for swift to become consistent seems like a solid one. To narrow the scope of the pain, we could limit this the case where App.run in build_image_incr.py infers that a full build is necessary, since that seems to be the only case where we've run into this issue so far.

It is curious that we don't consistently see log messages suggesting the swift driver's readAfterWriteTimeout expired, but lengthening that may help none-the-less (e.g., it's plausible that error is masked by some other error in certain cases, though I would need to look at the upload side of the code to know more).

@dancy what do you think about adding a configurable sleep in the use case that Scott pointed out? We could try with 300s and 600s in my opinion, very painful but probably needed at this point.

One additional point of note that was puzzling me before:

When a blob is committed during the terminating PUT of the upload, I've not seen any cases where verification of the digest fails [0], which is surprising if indeed the full extent of the blob object is not (yet) available to read in swift.

I only just now noticed that by default, the registry binary is built with "resumable" digests enabled. This incrementally hashes the blob chunk contents as they are received during the one or more chunk PATCH requests and persists hash state to the storage layer. On commit, the client-provided digest is just compared against the incrementally computed hash, reflecting the data "as received" rather than "at the storage layer."

This seems to be a plausible explanation for why we'd be unable to "read our writes" back in practice, despite having seemingly passed verification during commit.

[0] I have seen cases where verification fails due to an incorrect blob length, though these also tend to happen around when we see issues with nginx tmpfs exhaustion.

@dancy what do you think about adding a configurable sleep in the use case that Scott pointed out? We could try with 300s and 600s in my opinion, very painful but probably needed at this point.

With that experiment in place what would we be looking for as an outcome? Not seeing a corrupted pull in N deploys/days?

Assuming that the delay hypothesis was confirmed how would we move forward? Keep the artificial delay for the forseeable future? Iterate on short and shorter delays looking for a local minimum? Take it as a sign that we need to move off of the deprecated swift driver that we already know we need to move off of?

Can we test the delay to reach consensus hypothesis in a controlled experiment rather than as part of the live train and backports system by repeatedly publishing and pulling a large container? Maybe even one known to have experienced corruption in the past?

I'm working on something that will attempt to validate that an image is downloadable. If that works, I'll insert it after the build process so that it imparts the minimum necessary delay based to deployments.

+1 to @dancy's idea to poll for a usable / valid image, as a vastly preferable workaround to adding a sleep.

Following up on my footnote in T390251#10741310 about blob invalid length errors: Ahmon was able to trigger this in the absence of any tmpfs issues today, simply by executing the terminating PUT quickly after the upload PATCH. In fact, it sounds like this happened surprisingly frequently during the testing Ahmon was doing.

While we don't know exactly what the swift driver returned in response to the Stat call that did not match the running total of bytes uploaded, it does suggest that we can run into eventual consistency issues related swift DLO concerningly frequently.


@MatthewVernon - When you get a chance, has anything changed recently - e.g., around the week of 24th of March - that might have made eventual consistency in swift "more apparent"? In case it matters, all of this involves interactions with swift in codfw.

To summarize, since I realize there's a lot going on in this ticket:

  • The docker registry stores image data in swift using dynamic large objects, since in the general case, image layer blobs can be several GiB. In practice, the blobs we're seeing issues with are ~ 2 GiB.
  • Since the 27th of March, we've been seeing intermittent issues where, within "some small number of minutes" of writing a large object to swift (i.e., pushing an image), an attempt to read back that object will return partial content. Specifically, attempts to read valid ranges of that object may return a partial responses or an 416 response outright - in either case, it's as if the full extent of the object is not (yet) available.
  • A bit later, say within 10 minutes or so, the object can be read in its entirety without issue.

In practice, the fact that this manifests as serving invalid image layer blobs is more a reflection of code in components other than swift (i.e., dragonfly and docker registry). However, it would be good to know if you're aware of anything that might have exacerbated this behavior recently, or if you might have any tips to make this less painful (e.g., using DLO with fewer-but-larger segments).

@dancy what do you think about adding a configurable sleep in the use case that Scott pointed out? We could try with 300s and 600s in my opinion, very painful but probably needed at this point.

With that experiment in place what would we be looking for as an outcome? Not seeing a corrupted pull in N deploys/days?

Assuming that the delay hypothesis was confirmed how would we move forward? Keep the artificial delay for the forseeable future? Iterate on short and shorter delays looking for a local minimum? Take it as a sign that we need to move off of the deprecated swift driver that we already know we need to move off of?

Hi @bd808, I choose to read the sequence of questions as genuine interest and not as a passive aggressive comment, even if it very much looks like it. My idea was simply to test one delay to see if our theory of read-after-write consistency was right, I am aware that waiting during a deployment is not a great experience but we have been investigating this issue for quite some time (weeks at this stage) and new ideas were needed on the table. I am very glad that Scott and Ahmon proposed a better compromise, it will surely improve the deployment's experience. Regarding swift deprecation: we know that at some point we'll have to migrate away, but the process will be long and complex, there is no easy migration for it. Moreover, we'll probably want to understand if Docker Distribution is our long term choice, or if we want to move away from it completely. Finding a compromise for the time being is essential, this is why I proposed to test the delay.

I tried a scap sync-world (just to get an image push) and it did manage to push all images in around 15 minutes, but then failed pulling the image to the testserver-stage releases because of the bad blob.

Warning  Failed     8m52s                   kubelet            Failed to pull image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-04-15-110755-publish-81": rpc error: code = FailedPrecondition desc = failed to pull and unpack image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-04-15-110755-publish-81": failed commit on ref "layer-sha256:b555465930e038c1c962a1b7a099443d613974183da74bae039c528be687fda9": unexpected commit digest sha256:4569322f339381e74c22d34d3218f8240122bd5f82bbf0342e5092a498c5b636, expected sha256:b555465930e038c1c962a1b7a099443d613974183da74bae039c528be687fda9: failed precondition

I took it upon myself to add that sleep as we are currently unable to deploy.

@MatthewVernon - When you get a chance, has anything changed recently - e.g., around the week of 24th of March - that might have made eventual consistency in swift "more apparent"? In case it matters, all of this involves interactions with swift in codfw.

To summarize, since I realize there's a lot going on in this ticket:

  • The docker registry stores image data in swift using dynamic large objects, since in the general case, image layer blobs can be several GiB. In practice, the blobs we're seeing issues with are ~ 2 GiB.
  • Since the 27th of March, we've been seeing intermittent issues where, within "some small number of minutes" of writing a large object to swift (i.e., pushing an image), an attempt to read back that object will return partial content. Specifically, attempts to read valid ranges of that object may return a partial responses or an 416 response outright - in either case, it's as if the full extent of the object is not (yet) available.
  • A bit later, say within 10 minutes or so, the object can be read in its entirety without issue.

In practice, the fact that this manifests as serving invalid image layer blobs is more a reflection of code in components other than swift (i.e., dragonfly and docker registry). However, it would be good to know if you're aware of anything that might have exacerbated this behavior recently, or if you might have any tips to make this less painful (e.g., using DLO with fewer-but-larger segments).

Nothing changed re ms swift around 24 March, no. The main use of ms-swift doesn't use large objects at all (MW does its own split-upload-and-combine, which is itself sometimes buggy) and some of our tooling around it assumes no static large objects (from which flows the <5G restriction in commons, for instance). So you're using an unusual-for-this-cluster path here (which also means I don't have much of a feel of how/if it could be made more reliable). Swift does only promise eventual consistency, but I can see this is a bore for your use case. Sorry this is not a great deal of help :(

[apropos the deprecation of swift, we are bringing some production users to the apus Ceph cluster (which does S3), but that has async replication between DCs and is quite small, so may not be helpful (or at least not helpful yet) depending on your capacity & performance needs]

[apropos the deprecation of swift, we are bringing some production users to the apus Ceph cluster (which does S3), but that has async replication between DCs and is quite small, so may not be helpful (or at least not helpful yet) depending on your capacity & performance needs]

To keep archives happy - I had a chat with Matthew on IRC to make sure that the registry use case could have been onboarded on apus during the next fiscal year. On paper the answer seems to be "yes", but we'll need to test it of course. I think it could be a good idea to spin up a docker registry VM, configured it for apus and then start testing it properly (in a separate task though).

Hi @bd808, I choose to read the sequence of questions as genuine interest and not as a passive aggressive comment, even if it very much looks like it.

Thank you for that assumption of good faith. I certainly did not intend them as angry drive-bys. I apparently have yet again failed in humaning as desired by other humans.

At this point we have two problems.

  • Large image pushes are now unreliable (this seems new for mediawiki deployments). No workaround proposed yet.
  • Large image pulls are unreliable. @Clement_Goubert has added a 5 minute delay after building and pushing images to work around this.

At this point we have two problems.

  • Large image pushes are now unreliable (this seems new for mediawiki deployments). No workaround proposed yet.

I've been trying to keep up on this issue but there's so much going on.

Are the large images in their entirely (manifest + blobs) the problem, or is it specifically a large image layer (blob) that surfaces the issue in swift? Would reducing the max size of any given blob avoid the issue? If so, perhaps we can refactor the image build to sync new files into the image in chunks and keep the blob size down.

At this point we have two problems.

  • Large image pushes are now unreliable (this seems new for mediawiki deployments). No workaround proposed yet.

I've been trying to keep up on this issue but there's so much going on.

Are the entire large images (manifest + blobs) the problem, or is it specifically a large image layer (blob) that surfaces the issue in swift?

I believe it's the total quantity of data needing to be replicated (which means higher likelihood of seeing out-of-date or not-yet-replicated data).

Would reducing the max size of any given blob avoid the issue? If so, perhaps we can refactor the image build to sync new files into the image in chunks and keep the blob size down.

I suspect that this would change the shape of the problem (i.e., instead of having one 1GB object to replicate, there might be four 256MB objects to replicate) but leave us with the same problem (after writing X to Swift, a subsequent read of X may read from a replica which doesn't have X yet).

In the case of uploads, here is the bad sequence:

  • The client (e.g. dockerd) issues POST /v2/<repo>/blob/uploads/ to initiate an upload. This returns a new URL for subsequent operations (hereafter called the upload URL)
  • The client issues a PATCH to the upload URL to transmit the data.
  • The client issues a PUT to the upload URL to finalize the upload. This is where a 404 is sometimes returned by the registry (basically saying that it doesn't know about this upload). A 404 is more likely to be seen if a prior upload was large (i.e, if the replicator is busy). Retrying this PUT does eventually succeed.

In this case it's not the content of the upload that hasn't made it to the replica, but the existence of the upload state itself.

In the case of uploads, here is the bad sequence:

  • The client (e.g. dockerd) issues POST /v2/<repo>/blob/uploads/ to initiate an upload. This returns a new URL for subsequent operations (hereafter called the upload URL)
  • The client issues a PATCH to the upload URL to transmit the data.
  • The client issues a PUT to the upload URL to finalize the upload. This is where a 404 is sometimes returned by the registry (basically saying that it doesn't know about this upload). A 404 is more likely to be seen if a prior upload was large (i.e, if the replicator is busy). Retrying this PUT does eventually succeed.

In this case it's not the content of the upload that hasn't made it to the replica, but the existence of the upload state itself.

Ah, this makes a lot more sense to me now and jives with my understanding of upload chunking. Thank you!

So it sounds like this is not about image size or image layer size, but about the storage consistency of all chunks in a given blob upload (a given UUID) being in a race with the upload process itself. Is that accurate?

Also, I wonder if there's a way we can force monolithic uploads?

Also, I wonder if there's a way we can force monolithic uploads?

It seems that others have had this idea to solve this same issue and the comments around it are a bit disheartening.

https://github.com/distribution/distribution/issues/2188#issuecomment-449014392

Change #1133389 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] services: enable ingress for Kartotherian

https://gerrit.wikimedia.org/r/1133389

A couple of ideas:

  1. There is also another thing to keep into consideration, this is the first time that we observe this issue. We have other Docker images carrying big layers (like the ML-images), but I have no memory of errors while uploading an image (with docker-pkg or blubber), so this seems to be very specific to the MW images. Moreover, it seems to me that something changed recently in those images (because they were causing troubles before), crossing a line that triggered issues on the Swift side. I already brought up this idea, but the only big difference that I can see is that the MW images have tens of layers, while usually we have way less. Maybe we have too many layers at the moment, too many the Docker Registry's Swift driver? Even if this is not the culprit, I think that we should concentrate on what is different from other use cases (since it already helped solving part of this issue, the pulling phase).
  1. Keep in mind that nginx sits in front of the Docker Registry, and it runs with a 4GB tmpfs to hold uploads. We have seen errors reported by Nginx while debugging this issue, all related to "no space left on tmpfs". It is not the only problem, but it surely exacerbates it. We tried to "serialize" the push via Docker settings before, but it didn't work. Should we try to do something at the Scap level as test?
  1. Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

Couple of notes as well

  • scap is the only use case where multiple images are built and pushed together in short timeframes and also tried to be used immediately after. If this is indeed a race condition as we hypothesize, I don't think the other images would trigger it, just from the sheer fact there is an update to deployment-charts git repo involved that probably gives enough time for everything to coalesce. Furthermore, there aren't multiple images pushed together.
  • max-concurrent-uploads, which was tested and reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135460 (it's not clear how it made things worse and was reverted), serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.
  • It's not clear to me if there is a correlation of the no space left on tmpfs and the problems we are seeing. If it happens right before we witness one of these, it could be contributing to send a corrupted layer. But in that case, I am totally unclear as to why everything would coalesce some time later.

Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

It's 20.10.24+dfsg1-1+deb12u1 vs 20.10.5+dfsg1-1+deb11u4, so at least judging by version number there shouldn't be some huge change. That being said, I did scan through https://docs.docker.com/engine/release-notes/20.10/#20105 and I see the following interesting one only.

  • Add retry on image push 5xx errors. moby/buildkit#2043.

But that's about it. Everything else seems unrelated to pushing of images. I 'll see if it is already included in the +deb12u1 Debian patches. Trixie btw has 26.1.5+dfsg1-9+b2, but that's not yet released. We could go for it, but I 'd like something more concrete before we open that can of worms, since I 've already grepped (NOT read, not even skimmed) release notes for versions 23 to 26.1 and the words builder and push yielded very few results, none of them promising.

Now, I am mildly spit balling here, but we could spin up a secondary registry process, with Ceph+S3 backend and use it just for the /restricted part of the registry. This would move us away from the Swift Driver and the possible eventual consistency issues we are seeing. It would move us into possibly similar problems on the S3 driver and ceph issues of course. IIUC, all we need is a different port for a second instance of the daemon, a different configuration file and we might be able to test. Does that sound plausible/sensible?

swfrench opened https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/165

make-container-image: sleep only after full mediawiki image build

swfrench merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/165

make-container-image: sleep only after full mediawiki image build

Mentioned in SAL (#wikimedia-operations) [2025-04-16T19:30:25Z] <swfrench@deploy1003> Started scap sync-world: Test stop-before-sync scap run to pick up make-container-image changes - T390251

IIRC it caused more nginx 500 errors due to tmpfs exhaustion, we decided to rollback since it was making it really worse.

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

It's 20.10.24+dfsg1-1+deb12u1 vs 20.10.5+dfsg1-1+deb11u4, so at least judging by version number there shouldn't be some huge change. That being said, I did scan through https://docs.docker.com/engine/release-notes/20.10/#20105 and I see the following interesting one only.

  • Add retry on image push 5xx errors. moby/buildkit#2043.

But that's about it. Everything else seems unrelated to pushing of images. I 'll see if it is already included in the +deb12u1 Debian patches. Trixie btw has 26.1.5+dfsg1-9+b2, but that's not yet released. We could go for it, but I 'd like something more concrete before we open that can of worms, since I 've already grepped (NOT read, not even skimmed) release notes for versions 23 to 26.1 and the words builder and push yielded very few results, none of them promising.

Thanks for the research, indeed it seems something not useful, but I'd try to upgrade anyway. It happened to me in the past that useful commits were not listed in main changelogs, so even if there is something that improves the current situation it would be worth it in my opinion.

Now, I am mildly spit balling here, but we could spin up a secondary registry process, with Ceph+S3 backend and use it just for the /restricted part of the registry. This would move us away from the Swift Driver and the possible eventual consistency issues we are seeing. It would move us into possibly similar problems on the S3 driver and ceph issues of course. IIUC, all we need is a different port for a second instance of the daemon, a different configuration file and we might be able to test. Does that sound plausible/sensible?

I thought about the same, and it would be good to also get experience with S3 (that we'll have to use for any Docker Registry solution that we'll choose). it is a sizeable amount of work though, it will not probably happen soon. Let's find volunteers first :D (and/or establish a group of people willing to work on this cross teams).

Recap: after Clement's and Scott's patches it seems that we have temporarily fixed the consistency issue while pulling, that in my opinion was the bulk of the pain. While we think about improving the sleep time, it would be really nice if scap implemented its own ad-hoc push serialization logic, to verify that the second issue is indeed related to pushing too many Docker layers concurrently to the registry (of various sizes etc..).

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

As a short-term mitigation, it seems reasonable, and serializing the image pushes in build-images.py should be straightforward.

Last but not least: we are running with a very old Docker version on deploy1003, the one on Bullseye. We may gain some improvement simply moving to Bookworm's version, but I am not sure what are the blockers to move deploy1003 to Bookworm.

Yes, we really need to upgrade Docker. I would also like to get docker-buildx-plugin installed so we can make use of BuildKit-only features in our build.

Related to this, there are 2G of completely redundant layers in our PHP 7.4/8.1 "flavored" images due to the independent rsyncs of the same /srv/mediawiki-staging tree for each "flavor", and I'm wondering if eliminating this redundancy would help mitigate the registry storage issue.

Interestingly the tarball entries as well as the extracted contents of the two layers are identical. However, after writing a small Go program to compare the metadata between the entries of each layer tarball, I found that it's the mtime of /workdir (the bind mount target of /srv/mediawiki-staging during the rsync) and the mtime of /srv that differ.

 $  tardiff 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar
--- 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar:srv/
+++ 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar:srv/
@@ -8,7 +8,7 @@
 Gid: (int) 0,
 Uname: (string) (len=4) "root",
 Gname: (string) (len=4) "root",
-ModTime: (time.Time) 2025-04-21 06:09:20 -0700 PDT,
+ModTime: (time.Time) 2025-04-21 06:09:18 -0700 PDT,
 AccessTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 Devmajor: (int64) 0,
--- 2025-04-22-001437-publish/c8ff0884fccfb168ee9674a3ef5f1d3681dfc30c80b7d5485aae30c714c552d1/layer.tar:workdir/
+++ 2025-04-22-001437-publish-81/722c607ee26915ac2b6f16b763eb2255002698e8fe1c31d76553b55477c840c2/layer.tar:workdir/
@@ -8,7 +8,7 @@
 Gid: (int) 0,
 Uname: (string) (len=4) "root",
 Gname: (string) (len=4) "root",
-ModTime: (time.Time) 2025-04-21 06:09:29 -0700 PDT,
+ModTime: (time.Time) 2025-04-21 06:09:22 -0700 PDT,
 AccessTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
 Devmajor: (int64) 0,

If it weren't for these two mtime differences, the MW code layers would be identical, and I'm pretty sure we would save 2G of blob uploads (and download for each node). (What's a bit funny is that train-dev produces images without these differences in the MW code layers. I suspect this is because the rsync processes happen to finish during the exact same second, so obviously this is not a reliable way to get identical layers.)

IMO the best way to refactor build-images.py for layer efficiency (and possibly to mitigate this issue by reducing total image size) would be:

  1. Build a single "code only" image using the incremental rsync-based process.
  2. Push the code image immediately to allow the registry to start persisting it.
  3. After the code image is pushed, start to build the webserver, web app, cli, and debug images in a single BuildKit context and merge in the code image layers using COPY --link --from=mediawiki-code / / (BuildKit-only feature that achieves a deterministic reuse of existing layers).
  4. Push images (serially, still, if pushing the code image first isn't enough).

I've been working on this refactor a bit already, I'm hoping to have something to push up for evaluation/review this week. But again, it would be blocked on a newer Docker (with the default BuildKit builder) so whatever we can do to get that upgraded would be great.

I will also see about getting the serial image push change done tomorrow.

serializes only layers per push. Multiple pushes can still happen simultaneously and IIUC scap does do that. Maybe we could have a flag in scap like e.g. --image-push-concurrency=1 to at least verify/rule out this hypothesis? It would slow down deployment for a couple of weeks but would give a strong signal to guide us better into resolving this.

+1 I like this, @dancy @dduvall what do you think about it?

As a short-term mitigation, it seems reasonable, and serializing the image pushes in build-images.py should be straightforward.

In our team meeting yesterday, @dancy shared a fix for the swift storage driver! I'll let him update you all on the details, and I'll hold off on the other workarounds for now.

IMO the best way to refactor build-images.py for layer efficiency (and possibly to mitigate this issue by reducing total image size) would be:

  1. Build a single "code only" image using the incremental rsync-based process.
  2. Push the code image immediately to allow the registry to start persisting it.
  3. After the code image is pushed, start to build the webserver, web app, cli, and debug images in a single BuildKit context and merge in the code image layers using COPY --link --from=mediawiki-code / / (BuildKit-only feature that achieves a deterministic reuse of existing layers).
  4. Push images (serially, still, if pushing the code image first isn't enough).

I'm moving ahead with this idea in T392526: Refactor `build-images.py` to use a common code image and `docker buildx` for other reasons that I explain in the task, but no longer as a mitigation for this issue.

Hi all. I have prepared a sequence of 3 commits against the v2.8.3 tag of https://github.com/distribution/distribution/. They are located in the swift-hacks branch of my fork of that repo at https://github.com/dancysoft/distribution/commits/swift-hacks/. These changes address two problems:

  • Bad data delivered by the registry:

This is the most important problem. My changes address this problem by adding headers to blobs so the registry can reliably determine if it is looking at a consistent blob or not when reading. The registry will now never deliver data from an inconsistent blob.

  • Unreliable uploads:

It is possible that an upload that is created in one moment may not been seen the moment after (Swift does not offer read-after-write consistency) . Compensate for this by passing the X-Newest header on non-blob object retrievals from Swift. This ensures that the most up-to-date available information is used, reducing the likelihood of read-after-write inconsistencies. This comes at the cost of increased load on replicas for these types of accesses. That will be something to keep an eye on.

I recommend that we run a custom build of the registry with these changes to resolve the problems in this ticket for the time being. The ConsistencyTimeout setting should be set to a value that allows plenty of time for replication to complete.

Btw the changes mentioned above have been reviewed by @dduvall and @Scott_French .

I quickly checked the patches and the refactoring work is really nice, and I like the two new options. Really good job :)

I am a bit on the fence about how to deploy this in production, since we may encounter unexpected bugs for other use cases (like non-scap image push/pull etc..). Upstream will not review the patches since the swift backend is decommed, so we'll have to maintain our own fork for the time being. To safely test the new changes, it would be easy enough to spin up a new registry stack (VM + Redis instance + Swift bucket) and point scap and other tools to it incrementally, to quickly test if everything is good or not. But on the other hand, we are already planning to move away from Swift in favor of S3, so if we have to create a new stack I'd start with S3 directly.

The alternative is to build and deploy the new deb, deploy in prod and check how it goes during the next days, something that I am not totally fond of but it may be necessary (I am aware that Ahmon have probably tested multiple cases locally already, but the risk of unexpected side effects is not zero).

@akosiaris @Scott_French - lemme know if you have a preference, but this is a sign for me that we should start working on a medium term solution for the registry (even before talking about switching to something else). Maybe simply moving to apus/S3 could be enough?

I quickly checked the patches and the refactoring work is really nice, and I like the two new options. Really good job :)

I am a bit on the fence about how to deploy this in production, since we may encounter unexpected bugs for other use cases (like non-scap image push/pull etc..). Upstream will not review the patches since the swift backend is decommed, so we'll have to maintain our own fork for the time being. To safely test the new changes, it would be easy enough to spin up a new registry stack (VM + Redis instance + Swift bucket) and point scap and other tools to it incrementally, to quickly test if everything is good or not. But on the other hand, we are already planning to move away from Swift in favor of S3, so if we have to create a new stack I'd start with S3 directly.

The alternative is to build and deploy the new deb, deploy in prod and check how it goes during the next days, something that I am not totally fond of but it may be necessary (I am aware that Ahmon have probably tested multiple cases locally already, but the risk of unexpected side effects is not zero).

@akosiaris @Scott_French - lemme know if you have a preference, but this is a sign for me that we should start working on a medium term solution for the registry (even before talking about switching to something else). Maybe simply moving to apus/S3 could be enough?

We 've discussed this thing yesterday in the serviceops meeting. The 2 paths aren't particularly different, at least in the amount of work that is required to implement them. Whether we configure the /restricted namespace to go to an instance of the registry patched using @dancy's patches or a second instance of the same unpatched software using S3 as a backend, the configuration and testing side of the work needed is the same. What differs is:

  • the amount of work making sure APUs is up to the task for the S3 solution
  • the amount of work needed to build and package the software.

The latter is admittedly less, however it is also the one where we 'd be just putting a stopgap while also having to work on the former to put us on a sustainable path. So, we think we should go straight to the APUs/S3 solution, gambling that it will pay off more or less immediately, leaving the other approach as a hedge in case it doesn't.

Opened T394476 to see if apus can take over the current load that we have on Swift. After the sign-off we'll be able to reason about concrete next steps.

Change #1154301 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] registry: Minor Puppet cleanups

https://gerrit.wikimedia.org/r/1154301

Change #1154302 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry_ha: Refactor to make it docker_registry

https://gerrit.wikimedia.org/r/1154302

Change #1154301 merged by Alexandros Kosiaris:

[operations/puppet@production] registry: Minor Puppet cleanups

https://gerrit.wikimedia.org/r/1154301

Change #1155257 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Move rsyslog rules from init to web.pp

https://gerrit.wikimedia.org/r/1155257

Change #1155258 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Refactor to allow >1 instance

https://gerrit.wikimedia.org/r/1155258

Change #1155601 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] Rename docker_registry_ha's occurrences to docker_registry

https://gerrit.wikimedia.org/r/1155601

Change #1156761 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[labs/private@master] registry: Add hiera for the new hierarchy

https://gerrit.wikimedia.org/r/1156761

Change #1156762 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[labs/private@master] Remove old docker_registry_ha hiera keys

https://gerrit.wikimedia.org/r/1156762

Change #1156761 merged by Alexandros Kosiaris:

[labs/private@master] registry: Add hiera for the new hierarchy

https://gerrit.wikimedia.org/r/1156761

Change #1156767 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] pontoon: Add stack registry

https://gerrit.wikimedia.org/r/1156767

Change #1156767 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: Add stack registry

https://gerrit.wikimedia.org/r/1156767

Change #1154302 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry_ha: Refactor to make it docker_registry

https://gerrit.wikimedia.org/r/1154302

Change #1155257 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Move rsyslog rules from init to web.pp

https://gerrit.wikimedia.org/r/1155257

Change #1155258 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Refactor to allow >1 instance

https://gerrit.wikimedia.org/r/1155258

Mentioned in SAL (#wikimedia-operations) [2025-06-13T11:41:47Z] <akosiaris> T390251 re-enable puppet on registry1004 after merging puppet refactoring changes.

Change #1156809 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Make sure ports are in the right format

https://gerrit.wikimedia.org/r/1156809

Change #1156809 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Make sure ports are in the right format

https://gerrit.wikimedia.org/r/1156809

Mentioned in SAL (#wikimedia-operations) [2025-06-13T12:21:19Z] <akosiaris> T390251 re-enable puppet on all registries.

Change #1156829 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Instantiate APUs s3 backend instance

https://gerrit.wikimedia.org/r/1156829

Change #1156835 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] docker_registry: Pass defaults to 2 option parameters

https://gerrit.wikimedia.org/r/1156835

Change #1156835 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Pass defaults to 2 option parameters

https://gerrit.wikimedia.org/r/1156835

Change #1156829 merged by Alexandros Kosiaris:

[operations/puppet@production] docker_registry: Instantiate APUs s3 backend instance

https://gerrit.wikimedia.org/r/1156829

Change #1159291 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Update Cumin alias for Docker registry

https://gerrit.wikimedia.org/r/1159291

Change #1156762 merged by Alexandros Kosiaris:

[labs/private@master] Remove old docker_registry_ha hiera keys

https://gerrit.wikimedia.org/r/1156762

Change #1159291 merged by Alexandros Kosiaris:

[operations/puppet@production] Update Cumin alias for Docker registry

https://gerrit.wikimedia.org/r/1159291

Change #1155601 abandoned by Alexandros Kosiaris:

[labs/private@master] Rename docker_registry_ha's occurrences to docker_registry

Reason:

I think https://gerrit.wikimedia.org/r/c/labs/private/+/1156761/ and https://gerrit.wikimedia.org/r/c/labs/private/+/1156762 cover this as well, I 'll abandon, but feel free to restore, if I missed something.

https://gerrit.wikimedia.org/r/1155601

bd808 changed the subtype of this task from "Task" to "Bug Report".Aug 12 2025, 11:03 PM

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

@akosiaris looks like you had a lot of changes for making a new registry, is that registry ready? Or is it still in the testing phase?

The latter. The backend is setup and we 've had a couple of successful pushes, but also some weird behavior we want to reproduce and investigate. The plan is to switch just mediawiki's /restricted part to the new registry once we are confident

The /var/lib/nginx path is a separate mount point, using tmpfs, with a size of 4G.

Apparently this is too small for known work loads as today there was an issue pushing to the registry when this ran out of disk.

After the push worked at a later point this was back to 0 usage. (observed on registry2004)

16:45 < cdanis> 2025/09/16 15:56:55 [crit] 1731183#1731183: *1091214 pwrite() "/var/lib/nginx/body/0000003080" failed (28: No space left on device), client: 10.64.16.93, server: , request:  "PATCH /v2/restricted/mediawiki-multiversion/blobs/uploads....

My understanding from the last updates is that we are not actively pushing anything to the new registry with apus as backend, we have just quickly tested some months ago. Is it the right understanding?

After T406392 I think that we really exhausted the amount of engineering hours that are tolerable to debug an issue, and we should really focus on forming a working group that moves us away from Swift. It is not an easy and quick task, so we may want to reconsider the option of patching the current registry's swift driver as interim solution (previous some testing of course).

Lemme know your thoughts! Happy to coordinate the efforts for a WG if needed.

per radosgw-admin user stats --uid=docker-registrythere are only 11 objects in that account, which I think equates to it not being currently used.

T406392 is a good reminder of the fact that the bandaids we might otherwise fall back on (e.g., sleeps, internal retries) are not available in all contexts, so even though we've largely focused on the MediaWiki image use case here, we really need a more systematic solution.

So yes, +1 to prioritizing either / both of migrating away from Swift (clearly a longer-term effort) and reviving Ahmon's driver improvements (also non-trivial to test / deploy, but still faster), though this will likely need to wait until the new year for any significant progress.

@akosiaris do you think that the idea of forming a dedicated working group for the next couple of quarters could be feasible? I can take care of kicking it off and finding volunteers (sounds like me and Scott are already in :D).