Page MenuHomePhabricator

Move the docker registry's /restricted prefix to Docker Distribution backed up by Ceph
Open, Stalled, HighPublic

Description

We have been spending quite a lot of time in investigating and finding workarounds to fix inconsistency issues brought up by the Swift backend of Docker Distribution: T390251, T401533, T391935, T406392 (and possibly many more, tracked or not).

After a chat with Alexandros, we came up with the following high level plan as proposal to move things forward in Q3:

  1. Do basic testing with Docker Distribution and Ceph. Alexandros already started it, we have a separate docker distribution instance on registry hosts that uses Ceph as backend (see T394476 for the Data Persistence Part). We'd need to pick it up, and complete it (reasonable load test, push/pull of various image sizes, etc..). We'd need to pay attention to bottlenecks with the new storage infrastructure, and (hopefully few irrelevant) bugs in the Docker Distribution implementation of the Ceph driver/engine.
  1. Once we are reasonably sure that the new storage infrastructure and configuration, we could proceed with a simple test - we coordinate with Releng and on a certain day, we flip the /restricted Docker Registry prefix to the Docker Distribution instance with Ceph. The scap workflow will push the new image, and the Wikikube worker nodes will pull it. This flip is very easy since the main point of contact is still nginx (on the registry nodes), so we can flip the backend in its config very easily. In case of problems, we can revert the settings very quickly. The nice part is that no change will be needed on the k8s workers, it will be transparent to them. We thought about using more conservative approaches, but it may take a huge amount of time and we are not sure if we'd gain more reliability (for example, targeting only slices of Wikikube workers etc..).
  1. Move more prefixes/images to the new backend, working with Data Persistence along the way to make sure that we are good capacity wise etc..

Just to clarify, this is a short/medium solution to hopefully move away from the aforementioned problems and frustrations that we have been experiencing so far. The long term goal is more challenging, namely: do we want to stick with Docker Distribution? Do we want to move away from it, towards another open source solution? etc.. But it will require more time allocated by multiple teams, something that we don't have at the moment. So let's start with something easy enough to achieve :)

Thoughts and opinions are welcome!

Event Timeline

Whatever we do, it should not involve trying to get swift to sync between the two ms clusters :-(

Change #1224091 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker_registry: add the ML instance

https://gerrit.wikimedia.org/r/1224091

Change #1224091 merged by Elukey:

[operations/puppet@production] profile::docker_registry: add the ML instance

https://gerrit.wikimedia.org/r/1224091

To keep archives happy - I am working in T394476 to properly onboard ceph apus as docker registry backend, there seem to be some issue. Once this is solved, we'll be able to progress this task :(

JMeybohm edited projects, added ServiceOps new, Kubernetes, Epic; removed serviceops.
JMeybohm moved this task from Inbox to Backlog on the ServiceOps new board.

@dancy @Scott_French Hi! The apus testing is finally yielding some good results, so we can probably start talking about high level ideas related to the /restricted switch.

TL;DR from T394476
  • I spent quite a bit of time trying to investigate why docker push actions from build nodes were hanging indefinitely, and it turned out to be a ceph replication problem.
  • I tested various Docker pushes and everything worked.
  • The apus s3 backend can accept S3 "writes" from both eqiad and codfw, but it seems that replication can become messy. The Docker registry is active/passive, so it shouldn't be an issue in the immediate term, but it may if we'll decide to move towards active/active.
  • Replication in apus/s3 is provided without metrics, but it can be verified via ceph-shell commands, so we may be able to add some ad-hoc alerts to overcome the limitation. In future ceph releases we may get metrics, but Data Persistence doesn't have a timeline yet.
Plan for the switch

The apus/s3 Docker registry backend is a completely different instance on the registry[12]* hosts, and we target it via nginx config. So testing a MW deployment using the new backend should be as easy as deploying an nginx config, and rollback if anything goes wrong.

Caveats:

  • I haven't load tested docker pulling from a multitude of nodes like the Wikikube cluster. We have the Dragonfly cache, so I don't think it will be a problem, but let's keep it in mind.
  • When we'll change the nginx config, the new registry instance will point to an empty s3 bucket. So a MW deployment will push the first Docker images to it, but we'll not have something to rollback to unless we manually upload images beforehand. Not sure if it will be needed, but raising the point just in case.

Let me know if I missed anything, and if you have doubts/concerns and/or tests to do before we decide to proceed.

Thanks for the report @elukey. This sounds very promising!

Change #1229145 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] DNM: docker_registry: move /v2/restricted to the s3 restricted backend

https://gerrit.wikimedia.org/r/1229145

One further note - cluster-wide metrics on sync delay (as opposed to the headers on a particular object) will be in the "Tentacle" release, which will be out some time later this year.

Thanks for the report @elukey. This sounds very promising!

Thanks! When you have a moment let me know if what I wrote above is sound, namely if it is ok to just start clean or not (basically if it is ok not to be able to rollback when we build/deploy the first time after the switch). If it is fine, I think that we can probably choose when to do the test after the SRE summit (that will happen next week). Ideally we could just rebuild all images and try to push/pull them, and see how it goes.

Thanks! When you have a moment let me know if what I wrote above is sound, namely if it is ok to just start clean or not (basically if it is ok not to be able to rollback when we build/deploy the first time after the switch). If it is fine, I think that we can probably choose when to do the test after the SRE summit (that will happen next week). Ideally we could just rebuild all images and try to push/pull them, and see how it goes.

Your plan is sound. After updating or reverting nginx config, running scap sync-world will populate the active registry with the right stuff.

Thank you very much @elukey - that's great news!

+1 to @dancy's assessment that a simple sync-world should be sufficient to ensure the then-latest images are in the then-active registry.

There's a brief window between the switch and when the first push completes that a pull (i.e., on a k8s worker) would fail. That shouldn't really be a concern for normal multiversion images, since they'll already be cached on ~ all nodes, but for rarer images (e.g., the cli images used by mw-script, etc.) it may be. That's probably fine as long as that window is short and we give folks some warning.

[This assumes there's not some straightforward way to bootstrap the ceph-backed registry ahead of time. Please correct me if I'm wrong there.]

In any case, once we make the switch, we can start with a "noop" (i.e., no MediaWiki changes) sync-world to smoke-test the push side of things. As long as that looks good, we can sync-world again, but now triggering a rebuild via -Dfull_image_build:True, which would stress the pull side of things during the ensuing deployment.

There are three questions / thoughts that come to mind, which I need to think about more:

  • Serving capacity - As you point out, dragonfly should do its job to provide caching / coalescing. IIRC, there's still quite a bit of concurrency (i.e., across layer chunks / ranges), so it will still be an interesting to see how the new stack responds and (before hand) what are the key utilization metrics to monitor.
  • Storage capacity - Given the size of the MediaWiki images and frequency of change (i.e., full image builds a couple of times per week, due to a combination of production-image updates, train presync, and some number of changes that touch l10n files), what are the current constraints on storage capacity and monitoring around runway?
  • Read availability - While worker-level caching (and the ubiquity of MediaWiki image layers in said caches) papers over this to some degree, do we currently understand whether the radosgw issues that led to the push failures can lead to issues on the pull side?

In any case, this is all fantastic and I'm excited to test this out. Also, +1 to scheduling the test for some time after we return from the SRE Summit.

There is currently 3T of apus quota allocated to the docker-registry user cf. wikitech.

Thank you very much @elukey - that's great news!

+1 to @dancy's assessment that a simple sync-world should be sufficient to ensure the then-latest images are in the then-active registry.

Thanks both for the feedback, it seems relatively easy to schedule maintenance at this point!

There's a brief window between the switch and when the first push completes that a pull (i.e., on a k8s worker) would fail. That shouldn't really be a concern for normal multiversion images, since they'll already be cached on ~ all nodes, but for rarer images (e.g., the cli images used by mw-script, etc.) it may be. That's probably fine as long as that window is short and we give folks some warning.

Yeah I think it is a risk that we can accept if we do the maintenance during the infrastructure window or similar.

[This assumes there's not some straightforward way to bootstrap the ceph-backed registry ahead of time. Please correct me if I'm wrong there.]

In theory we could try to do some hacks to push images beforehand, but I think we can skip this bit and simply use the test that you outlined during an infra maintenance window. Lemme know!

In any case, once we make the switch, we can start with a "noop" (i.e., no MediaWiki changes) sync-world to smoke-test the push side of things. As long as that looks good, we can sync-world again, but now triggering a rebuild via -Dfull_image_build:True, which would stress the pull side of things during the ensuing deployment.

+1 perfect

There are three questions / thoughts that come to mind, which I need to think about more:

  • Serving capacity - As you point out, dragonfly should do its job to provide caching / coalescing. IIRC, there's still quite a bit of concurrency (i.e., across layer chunks / ranges), so it will still be an interesting to see how the new stack responds and (before hand) what are the key utilization metrics to monitor.

In theory we should be good, but apus is currently a tiny cluster so we'll have to watch its metrics closely. I am planning to have a chat with Data Persistence during the summit to prioritize, if possible, more hw for this cluster.

  • Storage capacity - Given the size of the MediaWiki images and frequency of change (i.e., full image builds a couple of times per week, due to a combination of production-image updates, train presync, and some number of changes that touch l10n files), what are the current constraints on storage capacity and monitoring around runway?

Matthew anticipated my reply, we currently have 3TB available that are more than enough for /restricted (even for other use cases if needed).

  • Read availability - While worker-level caching (and the ubiquity of MediaWiki image layers in said caches) papers over this to some degree, do we currently understand whether the radosgw issues that led to the push failures can lead to issues on the pull side?

Sadly no, I think this is one of the known unknowns. I tried to do some simple pulls and I didn't see any issue, so if any I'd expect some stalls due to the cluster being under pressure more than radosgw giving issues. We may not have a lot of production experience with Ceph yet (namely with a lot of real and demanding use cases), so this is a very good learning experience :)

@dancy @Scott_French I think we are ready to move forward with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229145, what do you think?

I am ok to proceed with the sync-world and full_image_rebuild during a MediaWiki infrastructure window (could be good event tomorrow's), but it will be during your night so you'll not be able to join (but if needed I can loop in Matthew to check the apus/Ceph side). The alternative is to schedule the upgrade during a MediaWiki infra window happening during your workday, totally fine for me. Lemme know what you prefer!

@dancy @Scott_French I think we are ready to move forward with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1229145, what do you think?

I am ok to proceed with the sync-world and full_image_rebuild during a MediaWiki infrastructure window (could be good event tomorrow's), but it will be during your night so you'll not be able to join (but if needed I can loop in Matthew to check the apus/Ceph side). The alternative is to schedule the upgrade during a MediaWiki infra window happening during your workday, totally fine for me. Lemme know what you prefer!

I have no strong preference. I don't feel the need to be present during the tests. It sounds like you have everything well in hand. Note that @hashar is on train duty this week and in your timezone.

Thank you, @elukey!

No objections to targeting the "UTC mid-day" infra window (although I won't be around), particularly if it means we have more Ceph expertise on-hand during the initial testing vs. the "UTC late" window.

In any case, we'll want to document (1) the relevant dashboards to assess whether we're pushing the apus cluster too hard and (2) the rollback procedure if something goes awry later on (e.g., if neither of us are around).

I think #2 would be pretty straightforward:

... where the same caveats apply about the short window in between (i.e., pulls failing in rare cases where the then-latest image is not cached).

Do we know what #1 might look like? Presumably we'll want some view of both the 3x FEs that host radosgw and the 3x BEs that host the actual OSDs (i.e., since both are potential bottlenecks). Not sure if that already exists in some form in Grafana (I can't seem to find anything, but might not be searching correctly). Is this something you might be able to point us to @MatthewVernon?

Also, if it would be useful to chat about any of these synchronously before proceeding, please let me know and I'm happy to do so.

Thank you, @elukey!

No objections to targeting the "UTC mid-day" infra window (although I won't be around), particularly if it means we have more Ceph expertise on-hand during the initial testing vs. the "UTC late" window.

To keep archives happy - fewer folks around from Service Ops and Data Persistence, so after a chat with Scott we decided to postpone this by some days just to be sure.

In any case, we'll want to document (1) the relevant dashboards to assess whether we're pushing the apus cluster too hard and (2) the rollback procedure if something goes awry later on (e.g., if neither of us are around).

I think #2 would be pretty straightforward:

... where the same caveats apply about the short window in between (i.e., pulls failing in rare cases where the then-latest image is not cached).

+1 I thought something along those lines as well.

Do we know what #1 might look like? Presumably we'll want some view of both the 3x FEs that host radosgw and the 3x BEs that host the actual OSDs (i.e., since both are potential bottlenecks). Not sure if that already exists in some form in Grafana (I can't seem to find anything, but might not be searching correctly). Is this something you might be able to point us to @MatthewVernon?

I think:

https://grafana-rw.wikimedia.org/d/WAkugZpiz/rgw-overview
https://grafana-rw.wikimedia.org/d/fad89594-86df-499e-9fa4-1eb612d90dd1/ceph-cluster

And then scap sync-world plus its variant to rebuild all the images. I'd also run something like the following on moss-be1001 to check the replication status:

sudo cephadm shell -- radosgw-admin bucket sync status --bucket=registry-restricted

We don't have metrics about it, so we should probably have something running periodically on the hosts that executes the cephadm commands to then produce Prometheus time series (that we can easily expose).

@hashar Hi! Would you be available Mon/Tue next week, during the MW Infrastructure window, to assist from the Releng side? Ideally we'll just need to deploy a couple of times and see if everything is smooth and easy, and rollback otherwise. Lemme know!

To keep archives happy - me and Matthew will tentatively schedule the move for Wed 11 at 11 UTC (MediaWiki infra window).

The other dashboard that's worth looking at is the sync status one (that looks at the between-DC replication) - https://grafana.wikimedia.org/d/rgw-sync-overview/rgw-sync-overview

Unfortunately, the other useful thing it's worth keeping an eye on is the sync status, which isn't exposed as a metric (I think this is fixed in the upstream Tentacle release), but you can check on one of the controller nodes moss-be{1,2}001 cf wikitech

Change #1229145 merged by Elukey:

[operations/puppet@production] docker_registry: move /v2/restricted to the s3 restricted backend

https://gerrit.wikimedia.org/r/1229145

Mentioned in SAL (#wikimedia-operations) [2026-02-11T11:04:41Z] <elukey> move the Docker Registry's /v2/restricted prefix (MW Images) to the s3/apus backend - T412951

The full image rebuild deployment went really nicely, a total of 37 mins from beginning to end:

11:16:41 Started scap sync-world: Test the new s3/apus docker registry backend - full reimage rebuild
11:16:43 Started cache_git_info
11:16:44 Finished cache_git_info (duration: 00m 01s)
11:16:44 Started l10n-update
11:16:46 Updating ExtensionMessages-1.46.0-wmf.14.php
11:16:52 Updating LocalisationCache for 1.46.0-wmf.14 using 30 thread(s)
11:16:52 Running rebuildLocalisationCache.php
11:16:55 0 languages rebuilt out of 545
11:16:55 Use --force to rebuild the caches which are still fresh.
11:16:56 Generating JSON versions and md5 files (as www-data)
11:17:00 Updating ExtensionMessages-1.46.0-wmf.15.php
11:17:07 Updating LocalisationCache for 1.46.0-wmf.15 using 30 thread(s)
11:17:07 Running rebuildLocalisationCache.php
11:17:10 0 languages rebuilt out of 545
11:17:10 Use --force to rebuild the caches which are still fresh.
11:17:12 Generating JSON versions and md5 files (as www-data)
11:17:14 Finished l10n-update (duration: 00m 29s)
11:17:14 Checking for new runtime errors locally
11:17:17 Started build-and-push-container-images
11:17:17 K8s images build/push output redirected to /home/elukey/scap-image-build-and-push-log
11:37:13 Finished build-and-push-container-images (duration: 19m 55s)
11:37:14 Started sync-masters
11:37:27 sync-masters: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)             
11:37:27 Finished sync-masters (duration: 00m 12s)
11:37:27 Started sync-testservers-k8s
11:42:00 K8s deployment progress: 100% (ok: 12; fail: 0; left: 0)               
11:42:00 Finished sync-testservers-k8s (duration: 04m 33s)
11:42:00 Started check-testservers
11:42:00 Executing check 'check_testservers_k8s-1_of_2'
11:42:00 Executing check 'check_testservers_k8s-2_of_2'
11:42:16 Finished check-testservers (duration: 00m 16s)
11:42:16 Started sync-canaries-k8s
11:44:57 K8s deployment progress: 100% (ok: 62; fail: 0; left: 0)               
11:44:57 Finished sync-canaries-k8s (duration: 02m 40s)
11:44:57 Waiting 20 seconds for canary traffic...
11:45:18 Logstash checker Counted 0 error(s) in the last 20 seconds. OK.
11:45:18 Started sync-prod-k8s
11:54:03 K8s deployment progress:  94% (ok: 1780; fail: 0; left: 103)           
11:54:03 Finished sync-prod-k8s (duration: 08m 45s)
11:54:03 Started scap-cdb-rebuild-prod
11:54:05 scap-cdb-rebuild: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)         
11:54:05 Finished scap-cdb-rebuild-prod (duration: 00m 01s)
11:54:05 Started sync-wikiversions-prod
11:54:06 sync-wikiversions: 100% (in-flight: 0; ok: 1; fail: 0; left: 0)        
11:54:06 Finished sync-wikiversions-prod (duration: 00m 01s)
11:54:06 Running purgeMessageBlobStore.php
11:54:08 Finished scap sync-world: Test the new s3/apus docker registry backend - full reimage rebuild (duration: 38m 01s)

This includes two 5-mins wait time, so it could have potentially be completed in 27 mins.

elukey@deploy2002:~$ grep "300 seconds" /home/elukey/scap-image-build-and-push-log
11:25:42 [mediawiki-publish-83-next] Waiting 300 seconds for swift after full mediawiki image build (T390251)
11:32:11 [mediawiki-publish-83] Waiting 300 seconds for swift after full mediawiki image build (T390251)

Detailed logs for scap-image-build-and-push-log in deploy2002:/home/elukey/scap-image-build-and-push-log_T412951

I didn't observe anything weird in the registry's or scap's logs, and the apus s3 metrics look good (Matthew confirmed).

The only weird detail to report is that the replication with eqiad was lagging a bit, and Matthew had to restart the RGW daemons to speed it up. It is not clear if it would have converged eventually with a bit more patience, but we can observe what happens during the next deployments in my opinion. We need to add an alarm that pings us when the S3 bucket gets out of sync between codfw and eqiad, but it can be done anytime.

So far this is a success! Let's see how it goes during the next days!

To rollback (quoting Scott's previous commnet):

Mentioned in SAL (#wikimedia-operations) [2026-02-11T14:58:53Z] <Emperor> restart codfw apus frontends T412951

Mentioned in SAL (#wikimedia-operations) [2026-02-11T15:38:52Z] <elukey> [ROLLBACK] move the Docker Registry's /v2/restricted prefix (MW Images) to the s3/apus backend - T412951

This includes two 5-mins wait time, so it could have potentially be completed in 27 mins.

elukey@deploy2002:~$ grep "300 seconds" /home/elukey/scap-image-build-and-push-log
11:25:42 [mediawiki-publish-83-next] Waiting 300 seconds for swift after full mediawiki image build (T390251)
11:32:11 [mediawiki-publish-83] Waiting 300 seconds for swift after full mediawiki image build (T390251)

Just a minor FYI, these sleeps happen in parallel, so its unlikely that we'll get a 10 minute win once they're removed.

Sadly we had to rollback, we hit again the small files problem and ceph's rgw needed to be restarted. After a chat with Scott we preferred to rollback to Swift and avoid having stalling issues at random times.

What is the issue?

For some reason, after a while a docker push including small files fail with 500s from the Registry. The Registry gets stuck trying to push via the apus s3 api those files, ending up with envoy (on the apus side) to return a 504. The culprit seems to be the apus rgw s3 daemon, once restarted it works perfectly and the docker push gets through without issues. Why it happens it is a mistery, we'll need to figure out a way to repro it.

Let me just record some observations from yesterday before they vanish from mind/logfiles.

First, sync pretty much kept up during the initial migration (occasional lags, but those are expected) - just before noon UTC I did restart the frontends as it seemed like lag hadn't moved in a while, and that rapidly resolved things.

When we got complaints about the train deploy not working, 504s were not obvious on the envoy dashboards for apus, and sync was up-to-date. But they were visible in /var/log/envoy/global_tls.log on the RGWs, and (I think!) reproducible by trying to PUT a small file using s3cmd with the docker-registry credentials.

We only see 504s on 2/3 of the codfw frontends (and none in eqiad). Throughout the incident, the RGWs continue to handle other requests (including PUTs from the docker-registry user). In case it's useful, all the 504s are in (NDA) P88794.

The 504s are all flagged SI by envoy (stream_idle_timeout), and if I'm reading the logs right, envoy has always sent 14 bytes (and typically received 0, but sometimes has received some bytes at the start of the problem). Grepping the rgw logs (e.g. journalctl -u ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@rgw.rgw.apus.codfw.moss-fe2001.gamapq.service -S "2026-02-11 11:00" -g 'PATH_OF_INTEREST') for the paths in the envoy error logs doesn't get any hits - i.e. it seems that the RGW is not registering the request at all.

I agree that a reproducer would be really useful (and/or the opportunity to poke at a connection in the bad state, but that might be hard through all the layers of TLS / envoy / podman / ...)

elukey changed the task status from Open to Stalled.Thu, Feb 12, 11:33 AM

Let's keep T394476 for the technical work, setting this task to stalled until we find a fix.