Page MenuHomePhabricator

Pushing to the docker registry fails with 500 Internal Server Error
Open, HighPublic

Description

I've attempted to make several scap deployments to MediaWiki today, specifically to deploy r1216867 and r1217235 two times (SAL 1, SAL 2). Those deployments were done via Spiderpig (jobs #1065 and #1066), and they failed with this error:

17:47:12 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-12-10-174701-publish-83
18:38:16 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error
18:38:16 [mediawiki-publish-83] Traceback (most recent call last):
  File "/srv/mwbuilder/release/make-container-image/app.py", line 145, in join
    future.result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 548, in build
    mw_mv_image, mw_mv_debug_image, mw_mv_cli_image = app_instance.build_mediawiki_images(
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 260, in build_mediawiki_images
    image = build_image_incr.App(
  File "/srv/mwbuilder/release/make-container-image/build_image_incr.py", line 171, in run
    self.push_image(report["image"])
  File "/srv/mwbuilder/release/make-container-image/app.py", line 91, in push_image
    self.check_call(["sudo", "/usr/local/bin/docker-pusher", "-q", image_ref])
  File "/srv/mwbuilder/release/make-container-image/app.py", line 72, in check_call
    return subprocess.check_call(
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-12-10-174701-publish-83']' returned non-zero exit status 1.

Traceback (most recent call last):
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 619, in <module>
    main()
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 603, in main
    threaded_app.join()
  File "/srv/mwbuilder/release/make-container-image/app.py", line 151, in join
    raise ThreadedAppJoinException("One or more tasks raised an exception")
app.ThreadedAppJoinException: One or more tasks raised an exception

18:38:16 Finished build-and-push-container-images (duration: 51m 15s)
18:38:16 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.4,1.46.0-wmf.5,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 51m 32s)
18:38:16 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=urbanecm', 'Backport for [[gerrit:1216867|Confirmation email: further styling adjustments (T411526)]], [[gerrit:1217235|i18n: replace <> to avoid false positive export errors]]']' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 51m 57s)

I pulled a revert of both patches to the deployment server and run a scap sync-world manually. That seems to be progressing correctly. Not sure if this is something caused by the patches I was deploying, or something else, but it is making me not trust deployment infrastructure for the moment.

Event Timeline

Urbanecm_WMF triaged this task as Unbreak Now! priority.Dec 10 2025, 8:06 PM

Nothing immediately obvious in the registry logs:

Dec 10 18:38:16 registry2005 docker-registry[608]: time="2025-12-10T18:38:16.115983796Z" level=error msg="error canceling upload after error: already committed" go.version=go1.19.8 http.request.host=docker-registry.discovery.wmnet http.request.id=52d055ca-f623-4185-9e0e-d69a2e83f977 http.request.method=PUT http.request.remoteaddr=10.192.32.7 http.request.uri="/v2/restricted/mediawiki-multiversion/blobs/uploads/b01bff22-794f-4341-bbd9-41d53187ffd1?_state=[redacted]&digest=[redacted]" http.request.useragent="docker/20.10.5+dfsg1 go/go1.15.15 git-commit/363e9a8 kernel/5.10.0-36-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.5+dfsg1 \\(linux\\))" vars.name=restricted/mediawiki-multiversion vars.uuid=b01bff22-794f-4341-bbd9-41d53187ffd1
Dec 10 18:38:16 registry2005 docker-registry[608]: time="2025-12-10T18:38:16.145477553Z" level=error msg="response completed with error" err.code=unknown err.detail="timeout expired while waiting for segments of /docker/registry/v2/repositories/restricted/mediawiki-multiversion/_uploads/b01bff22-794f-4341-bbd9-41d53187ffd1/data to show up" err.message="unknown error" go.version=go1.19.8 http.request.host=docker-registry.discovery.wmnet http.request.id=52d055ca-f623-4185-9e0e-d69a2e83f977 http.request.method=PUT http.request.remoteaddr=10.192.32.7 http.request.uri="/v2/restricted/mediawiki-multiversion/blobs/uploads/b01bff22-794f-4341-bbd9-41d53187ffd1?_state=[redacted]&digest=[redacted]" http.request.useragent="docker/20.10.5+dfsg1 go/go1.15.15 git-commit/363e9a8 kernel/5.10.0-36-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.5+dfsg1 \\(linux\\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=13.175337905s http.response.status=500 http.response.written=70 vars.name=restricted/mediawiki-multiversion vars.uuid=b01bff22-794f-4341-bbd9-41d53187ffd1
Dec 10 18:38:16 registry2005 docker-registry[608]: 127.0.0.1 - - [10/Dec/2025:18:38:02 +0000] "PUT /v2/restricted/mediawiki-multiversion/blobs/uploads/b01bff22-794f-4341-bbd9-41d53187ffd1?_state=[redacted]&digest=[redacted] HTTP/1.1" 500 70 "" "docker/20.10.5+dfsg1 go/go1.15.15 git-commit/363e9a8 kernel/5.10.0-36-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.5+dfsg1 \\(linux\\))"

Grepping for the request ID shows this:

Dec 10 18:38:16 registry2005 docker-registry[608]: time="2025-12-10T18:38:16.115838953Z" level=error msg="unknown error completing upload: timeout expired while waiting for segments of /docker/registry/v2/repositories/restricted/mediawiki-multiversion/_uploads/b01bff22-794f-4341-bbd9-41d53187ffd1/data to show up" go.version=go1.19.8 http.request.host=docker-registry.discovery.wmnet http.request.id=52d055ca-f623-4185-9e0e-d69a2e83f977 http.request.method=PUT http.request.remoteaddr=10.192.32.7 http.request.uri="/v2/restricted/mediawiki-multiversion/blobs/uploads/b01bff22-794f-4341-bbd9-41d53187ffd1?_state=[redacted]&digest=[redacted]" http.request.useragent="docker/20.10.5+dfsg1 go/go1.15.15 git-commit/363e9a8 kernel/5.10.0-36-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.5+dfsg1 \\(linux\\))" vars.name=restricted/mediawiki-multiversion vars.uuid=b01bff22-794f-4341-bbd9-41d53187ffd1

Okay, so the final sync-world has finished:

21:30 <+logmsgbot> !log urbanecm@deploy2002 Finished scap sync-world: test (duration: 76m 31s)

Not sure if whatever is happening is/was temporary, or whether it is (somehow) related to the patches I was attempting to deploy. Either way, 76 mins is very long, even for a full image rebuild.

Staying off deployment servers until this gets some clarity.

Kept the (somehow-successful) scap-image-build-and-push-log at deploy2002:/home/urbanecm/scap-image-build-and-push-log-T412265, if it is useful for anyone.

For the records I had a similar issue running the train last month in T408272#11369913; however simply trying to run scap train again worked at that time.

Urbanecm_WMF lowered the priority of this task from Unbreak Now! to High.Dec 11 2025, 12:50 PM

Decreasing to High, as several deployments completed since. I'll try again. Leaving open in case serviceops wants to investigate, especially given Andre's comment.

Okay, I tried deploying the two patches (https://gerrit.wikimedia.org/r/1217573, https://gerrit.wikimedia.org/r/1217574) again. Same error. Spiderpig job: https://spiderpig.wikimedia.org/jobs/1073, log archived on deploy2002:/home/urbanecm/scap-image-build-and-push-log-bad-T412265.

This is a blocker for the Growth team.

Let's make this a train blocker for the next train. If there is somehow an issue in our code that makes deployments likely to fail (though I cannot see how that could possibly be the case, maybe we have just been unlucky), then this would likely also be an issue of the upcoming train which is going to include those changes.

Here are pretrain image build and push times for the last several days:

Unless otherwise noted, full l10n rebuild occurred for each of these
image builds:

dec 07 13m 07s (confirmed l10n rebuild, despite the fast time)
dec 08 17m 40s
dec 09 17m 50s
dec 10 22m 36s
dec 11 29m 34s
dec 12 17m 42s
dec 13 29m 42s
dec 14 13m 09s (no l10n rebuild)

FWIW, the third deployment attempt has succeeded today. But, I still have very little idea about what went wrong and why, so I'm leaving this taks open for Release-Engineering-Team and serviceops as needed.

This happened again in the UTC late backport window:
https://sal.toolforge.org/log/5jwwKZsBvg159pQrFeSI

https://spiderpig.wikimedia.org/jobs/1104

From the build log:

21:12:36 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-12-16-211020-publish-83
22:02:55 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error
22:02:55 [mediawiki-publish-83] Traceback (most recent call last):
  File "/srv/mwbuilder/release/make-container-image/app.py", line 145, in join
    future.result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 433, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 548, in build
    mw_mv_image, mw_mv_debug_image, mw_mv_cli_image = app_instance.build_mediawiki_images(
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 260, in build_mediawiki_images
    image = build_image_incr.App(
  File "/srv/mwbuilder/release/make-container-image/build_image_incr.py", line 171, in run
    self.push_image(report["image"])
  File "/srv/mwbuilder/release/make-container-image/app.py", line 91, in push_image
    self.check_call(["sudo", "/usr/local/bin/docker-pusher", "-q", image_ref])
  File "/srv/mwbuilder/release/make-container-image/app.py", line 72, in check_call
    return subprocess.check_call(
  File "/usr/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '/usr/local/bin/docker-pusher', '-q', 'docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-12-16-211020-publish-83']' returned non-zero exit status 1.

Traceback (most recent call last):
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 619, in <module>
    main()
  File "/srv/mwbuilder/release/make-container-image/build-images.py", line 603, in main
    threaded_app.join()
  File "/srv/mwbuilder/release/make-container-image/app.py", line 151, in join
    raise ThreadedAppJoinException("One or more tasks raised an exception")
app.ThreadedAppJoinException: One or more tasks raised an exception

22:02:55 Finished build-and-push-container-images (duration: 52m 35s)
22:02:55 sync-world failed: <CalledProcessError> Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 54m 53s)
22:02:55 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kharlan', 'Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]]']' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 56m 19s)

Recapping my understanding:

  • We deploy a change that changes a large number of files
  • This results in a "full image build"
    • instead of pushing new layers to the docker registry, we build and push a whole new image
  • Full images builds create big images
  • Increasingly, when we push the big images to the docker registry, the registry is giving us 500 errors
  • The registry is returning 500s due to push timeouts
    • Timeouts are different error than the "bad blob" errors we've seen previously (i.e., T406392).

Assumptions:


Questions

  • Is my understanding correct? Are my assumptions correct?
  • Is network weather between deploy host and registry worse?
  • Are there resource problems on the deploy host? Or registry host? Or swift?
  • Anything else that might explain why deploys/image pushes are failing with more frequency?

I dug through grafana, but didn't see anything too damning.

@thcipriani - Thanks for pulling together T412265#11471277. Indeed, your understanding here is correct.

Cause

We believe this is another manifestation of the same class of failure modes we saw earlier this year in T390251.

To recap, swift is an eventually consistent object store, and the swift driver that ships with the docker registry is fundamentally flawed, in that it does not appropriately handle this (i.e., makes assumptions that do not hold in practice). Indeed, docker has subsequently removed the swift driver from the registry (and closed / declined PRs with proposed fixes).

This is what led to the proposal to replace our use of swift with ceph (T412951), which would allow us to use the well-supported s3 driver by way of apus, as well as @dancy's proposed set of workaround patches to the driver.

In any case, the way this manifests for the client is that a (large) image push will fail after dockerd exhausts its internal budget of retries pushing to the registry (which in practice can take up to an hour for a MediaWiki image).

Trigger

This part is trickier, as we've previously been unable to identify what has changed in order to make the effect of eventual consistency in the codfw swift cluster more pronounced. Moreover, this seems to manifest on two timescales:

  • Transient issues where large image pushes are unreliable for hours or days, such as reported in T406392.
  • Longer-duration issues where pushes are unreliable for weeks, such as we saw in T390251 in March / April and are seeing again right now.

The fact that the second time scale is "rare" makes this a bit hard to pinpoint, though now that we're seeing it again perhaps we'll have better luck identifying it. Ultimately, though, it really is "just" the trigger - i.e., we believe the cause is the poor design of the swift driver.


More speculative thoughts on a trigger

While looking for metrics that might correlate with a higher rate object metadata "churn" in the codfw swift cluster, which in turn might contribute to the (assumed) ongoing slowness, I happened into this view of the rate of async pending updates (deferred container updates). That shows two clear periods of elevated background updates that correlate suspiciously closely with the two rounds of eventual consistency "weather" we've seen this year.

Breaking that down by instance (ms-be host) those appear to correlate with the additions of ms-be2089 and later ms-be2090-2094.

While highly speculative - and again stressing that this may be a trigger, but is clearly not the cause of the problems we're seeing - @MatthewVernon, from your mental model of swift, (1) is it plausible that adding new ms-be hosts to the cluster may temporarily have this effect while they load, and (2) if so, do you have a way to estimate how long until the dust settles for ms-be209[0-4]?

Adding or removing hosts from the swift rings will create more "churn" - you have to make incremental changes to the swift rings, deploy them, and then let swift shuffle data around. We do this with the swift ring manager, which will change device weights in the rings by up to 4% of their full capacity in a go (this seems to be a reasonable trade-off in terms of not being too disruptive to user traffic and also not meaning a load/drain operation takes too long - it's about 3 weeks thus).

I typically track the impacts on swift using the dispersion graph on the Swift dashboard (e.g.); you can see that it's been quite busy this year, because as well as a couple of rounds of hardware refresh (drain old nodes, load new ones), we've also had to drain and re-load nodes to put them onto new per-rack VLANs (T354872) and to do disk-controller swaps (T400876).

All of which said, a single node would be relatively modest in expected impact, whereas draining and loading a batch at once may be more impactful.