Page MenuHomePhabricator

scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw
Closed, ResolvedPublic

Description

Latest instance, job #1687:

# …
14:09:38 [mediawiki-publish-83] sent 6,260,716 bytes  received 19,893 bytes  2,512,243.60 bytes/sec
14:09:38 [mediawiki-publish-83] total size is 9,401,368,757  speedup is 1,496.89
14:09:39 [mediawiki-publish-83] Commit docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83
14:09:39 [mediawiki-publish-83] Running docker commit -c ENTRYPOINT ["/usr/sbin/php-fpm8.3", "--nodaemonize", "--fpm-config", "/etc/php/8.3/fpm/php-fpm.conf"] -c LABEL vnd.wikimedia.builder.
name="scap" -c LABEL vnd.wikimedia.builder.version="4.243.0" -c LABEL vnd.wikimedia.scap.stage_dir="/srv/mediawiki-staging" -c LABEL vnd.wikimedia.scap.build_state_dir="/srv/mediawiki-stagin
g/scap/image-build" -c LABEL vnd.wikimedia.mediawiki.versions="1.46.0-wmf.21,1.46.0-wmf.22" -c LABEL vnd.wikimedia.build-type=incremental -c LABEL vnd.wikimedia.parent-image=docker-registry.
discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-135907-publish-83 rsync-2719044684066940925 docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-p
ublish-83
14:09:39 [mediawiki-publish-83] sha256:2daddb008c9fc63de8fba6a5084044250cebac2cd7ea5cbb937fd3206edb96dc
14:09:39 [mediawiki-publish-83] Pushing docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83
14:09:39 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83
14:10:32 [mediawiki-publish-83] blob upload unknown
14:10:32 [mediawiki-publish-83] Traceback (most recent call last):
# (snip, Python-side traceback is uninteresting I think – it just complains that the command exited nonzero)

Previous SpiderPigs: #1684, #1685, #1686.

This is currently blocking all deployments, including the fix for UBN T422143.


Trigger

Repooling apus.discovery.wmnet in codfw (active DC for the docker registry) on day 8 of the DC switchover led to a situation where the s3 driver in the apus-backed registry instances (such as the one for MediaWiki restricted images) was issuing operations against a mix of apus backends in eqiad and codfw. Since cross-DC replication is asynchronous, this led to persistent consistency issues until the registry instances were restarted, thus clearing cached connections to apus. See T422166#11783170.

Near-term mitigation (this task)

  • The apus service will be excluded from the switchover (i.e., apus.discovery.wmnet will not be touched on day 1 or 8).
  • The manual switchover process for apus (e.g., for DC-wide maintenance or a proper disaster scenario) will be documented and cross-linked in the relevant locations linked in T422166#11808965. Following a change in pooled state on the discovery service, we need to restart docker-registry-restricted.service and docker-registry-ml.service on A:docker-registry hosts in a safe / paced way (e.g., one a time). This could be achieved with a cumin command, or a dedicated cookbook.

Long-term fix (not in scope for this task)

We have reason to believe that the overly aggressive connection caching in the s3 driver should go away in later versions of docker registry. Upgrading is blocked on retaining support for swift, and will likely not see action until next FY.

Event Timeline

Timeline note: this comes hot on the tail of T422130, for which @jasmine_ repooled codfw slightly earlier than scheduled (which seems to have helped with that issue). Not yet clear if this is related or not; @jasmine_ is looking into it (see also #wikimedia-operations IRC log).

dockerd logs on deploy1003 for the above example:

Apr 02 14:09:17 deploy1003 dockerd[1070]: time="2026-04-02T14:09:17.561327804Z" level=info msg="ignoring event" container=c8f32695fd426caa327d6d50d5c623e1013e22022323ec1e3be211a355b55385 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:19 deploy1003 dockerd[1070]: time="2026-04-02T14:09:19.044237792Z" level=info msg="ignoring event" container=8918447ab4fa5f7314db46109e41c2565ef75af75f8f7049411edcc30820d256 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:20 deploy1003 dockerd[1070]: time="2026-04-02T14:09:20.172159547Z" level=info msg="ignoring event" container=75617a5b66e3785c1fb590189ea2017505ee58bf260bce0df7b265dbcd7ffbaa module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:21 deploy1003 dockerd[1070]: time="2026-04-02T14:09:21.534096425Z" level=info msg="ignoring event" container=bb43a6919972142c7096539209df318ff8ab26e3966c59725cb7d3fa562e82b0 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:22 deploy1003 dockerd[1070]: time="2026-04-02T14:09:22.637028676Z" level=info msg="ignoring event" container=2093df2812dc0b4a08638d76079b39489e57891d714f05ee6ef0704ee8881a84 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:23 deploy1003 dockerd[1070]: time="2026-04-02T14:09:23.744295464Z" level=info msg="ignoring event" container=b7c8711f478145c030a3753f726ad5a5aa58d1be9d29b941be9dc7a3617bbe2d module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:25 deploy1003 dockerd[1070]: time="2026-04-02T14:09:25.628415019Z" level=info msg="ignoring event" container=a4f9362b6e1e9a5f3a2646267b0d4f50e384757fe0af9236c26d4f2e60a8b2d1 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:26 deploy1003 dockerd[1070]: time="2026-04-02T14:09:26.741784645Z" level=info msg="ignoring event" container=43ed449ab0803bcf78aae693ca5274603a63f428cf65b37386103ef0e2baa645 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:29 deploy1003 dockerd[1070]: time="2026-04-02T14:09:29.409459877Z" level=info msg="ignoring event" container=514baacfb03a13900f18ea6e514c3cf2276168acd5cd2b84a9d1c71772e16e2f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:32 deploy1003 dockerd[1070]: time="2026-04-02T14:09:32.538978276Z" level=info msg="ignoring event" container=71d7895eec21f77034dbb229a2dbcc0b5215421d05f0783b73f1644c50248af5 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:33 deploy1003 dockerd[1070]: time="2026-04-02T14:09:33.411668144Z" level=info msg="ignoring event" container=9c4a36721a5504fe4dcc8f442f40d026a0b4b20757a10418f022ffe7dc0b7500 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:35 deploy1003 dockerd[1070]: time="2026-04-02T14:09:35.566025428Z" level=info msg="ignoring event" container=a2b6cbbd5a23224ba22869d5e871dd720b0cca621adf79724c7e8ba1039fa8c6 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:38 deploy1003 dockerd[1070]: time="2026-04-02T14:09:38.729786762Z" level=info msg="ignoring event" container=9145de9c245dfd0089685a0afcee74c1fa45689d2212d9b2fea3d34f2e2b0c70 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr 02 14:09:40 deploy1003 dockerd[1070]: time="2026-04-02T14:09:40.089926591Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:09:40 deploy1003 dockerd[1070]: time="2026-04-02T14:09:40.534389785Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:09:45 deploy1003 dockerd[1070]: time="2026-04-02T14:09:45.458451907Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:09:46 deploy1003 dockerd[1070]: time="2026-04-02T14:09:46.174875827Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:09:56 deploy1003 dockerd[1070]: time="2026-04-02T14:09:56.092401488Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:09:57 deploy1003 dockerd[1070]: time="2026-04-02T14:09:57.088219803Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:10:11 deploy1003 dockerd[1070]: time="2026-04-02T14:10:11.529191666Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:10:12 deploy1003 dockerd[1070]: time="2026-04-02T14:10:12.525714865Z" level=error msg="Upload failed, retrying: blob upload unknown"
Apr 02 14:10:32 deploy1003 dockerd[1070]: time="2026-04-02T14:10:32.427502061Z" level=error msg="Upload failed: blob upload unknown"
Apr 02 14:10:32 deploy1003 dockerd[1070]: time="2026-04-02T14:10:32.427565597Z" level=info msg="Attempting next endpoint for push after error: blob upload unknown"

... which is not terribly exciting on its own - i.e., is just consistent with dockerd repeatedly failing (and retrying) at blog upload.

The more interesting logs are from the (restricted) registry instances in codfw, for example:

Apr 02 15:14:08 registry2005 docker-registry[631]: time="2026-04-02T15:14:08.523901214Z" level=error msg="error resolving upload: s3aws: Path not found: /docker/registry/v2/repositories/restricted/mediawiki-singleversion/_uploads/205e4760-3c31-4751-9e12-4f9dd1ea5a88/data [...]

... which suggest an inconsistent view of state in ceph after the blob upload is ostensibly completed by the client.

(Edit: Updated with logs from IRC)

Mentioned in SAL (#wikimedia-operations) [2026-04-02T15:26:11Z] <swfrench-wmf> restarted docker-registry-restricted.service on registry200[45] - T422166

Mentioned in SAL (#wikimedia-operations) [2026-04-02T15:31:16Z] <swfrench-wmf> restarted docker-registry-ml.service on registry200[45] - T422166

Scott_French lowered the priority of this task from Unbreak Now! to Medium.EditedApr 2 2026, 4:04 PM
Scott_French added subscribers: CDanis, elukey.

This was a curious one. Many thanks to @elukey and @CDanis for the assistance.

tl;dr - Cached connections in the (restricted) docker registry's s3 driver resulted in a form of split-brain, where operations were seemingly performed against a mix of eqiad and codfw apus. Since these are only eventually consistent, this would lead to various read-your-writes violations.

What happened

So, this was the first DC switchover (today was the day codfw was repooled) after moving the restricted registry to s3-on-apus.

Note: The registry itself (docker-registry.discovery.wmnet) never switches. It stays in codfw, and this has not changed.

What has changed is that we've gone from always using DC-local (codfw) swift to following apus.discovery.wmnet, which is active/active and was repooled today around 13:19.

This gave rise to a situation where some cached connections to eqiad apus lingered after the repool of codfw (i.e., at which point codfw registry instances should observe apus.discovery.wmnet resolve to codfw), leading to the inconsistencies referred to above.

To fix this, we simply needed to restart the apus-backed registry instances (restricted and ml).

Next steps

In any case, we'll need to give some thought to how / whether to evolve this.

Per @elukey, it sounds like there are connection-caching improvements in a more recent version of the s3 driver, though we cannot adopt it while still needing swift support.

Questions (not advocating for any specific "fix" at this time):

  • Should we point the restricted instance to DC-local apus? (similar to what is done with swift)
  • Should we exclude apus from the switchover? (as a temporary measure)
  • Should we establish a process around restarts (e.g., of the apus-backed registry instances) when apus.discovery.wmnet pooled state changes?
  • Should we have some sort of check / alert that detects when the registry instances are persistently inconsistent w.r.t. their upstream apus cluster? (this is "easy" as a shell one-liner, but seems fiddly to automate cleanly)
  • Why didn't we see this problem when codfw was depooled last week?

I need to give this some thought.

Scott_French renamed this task from scap can’t deploy: blob upload unknown to scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.Apr 2 2026, 7:16 PM
Scott_French added a subscriber: Blake.

Moving this into ServiceOps new, since we're probably the right team to figure out how this should be reflected in the switchover process, if at all.

jijiki changed the task status from Open to In Progress.Apr 6 2026, 2:46 PM
jijiki moved this task from Inbox to In Progress on the ServiceOps new board.

Change #1269382 had a related patch set uploaded (by Blake; author: Blake):

[operations/puppet@production] service: exclude apus from the switchover.

https://gerrit.wikimedia.org/r/1269382

@Scott_French It sounds like it might be reasonable to exclude this service from the switchover, add a new Cumin alias for docker-registry-eqiad and docker-registry-codfw, and then add a small cookbook which can restart the relevant systemd service in either DC. Does that seem like an appropriate way to proceed?

Thanks, @Blake. Yes, excluding the apus service from switchover day 1 seems like the right approach for now given the special handling required (by analogy with other services we've excluded).

Since we may need to depool / pool apus anyway (e.g., for maintenance of some sort), we need to document the follow-on steps to restart the relevant services [0].

I guess that would live in https://wikitech.wikimedia.org/wiki/Ceph/Cephadm? Not sure if @elukey or @MatthewVernon might have thoughts on that.

I'm on the fence as to whether we need a custom cookbook for the restarts vs. a well-documented cumin command, given that it's unclear how long we'll need this exclusion.

[0] The heavy-weight solution would be a custom cookbook that does it all - i.e., modifies pooled state and potentially handles the restarts if needed. Even if we did do that, we'd still need to document it somewhere.

I've eyeballed the discussion here - AFAICT apus is behaving as expected? I haven't seen persistent lag between the two clusters, but during bursts of activity replication between the DCs is asynchronous (by design). The problem is the registry (due to caching connections) writing to both clusters at the same time but assuming it's only writing to one, and thus being thrown by asynchronous replication between the clusters.

I'm relaxed about how apus is handled during switchovers (so if you want to change it, go ahead, as long as I know :) ).

I kindof think that https://wikitech.wikimedia.org/wiki/Ceph/Cephadm is where docs about apus (or other cephadm-managed clusters) lives, which could include general client configuration advice; but I'd likewise expect docs about "things you need to do to the docker registry during a switchover" to live elsewhere. I'm not precious about it, though!

I've eyeballed the discussion here - AFAICT apus is behaving as expected? I haven't seen persistent lag between the two clusters, but during bursts of activity replication between the DCs is asynchronous (by design). The problem is the registry (due to caching connections) writing to both clusters at the same time but assuming it's only writing to one, and thus being thrown by asynchronous replication between the clusters.

Yes, that's exactly correct - the problem here is the connection caching across changes in pooled state for apus.discovery.wmnet.

I'm relaxed about how apus is handled during switchovers (so if you want to change it, go ahead, as long as I know :) ).

I kindof think that https://wikitech.wikimedia.org/wiki/Ceph/Cephadm is where docs about apus (or other cephadm-managed clusters) lives, which could include general client configuration advice; but I'd likewise expect docs about "things you need to do to the docker registry during a switchover" to live elsewhere. I'm not precious about it, though!

Great, thank you!

Indeed, the story of where the docs should live is a little muddled - e.g., on the one hand, special instructions related to switchovers probably belong adjacent do the switchover docs, but on the other, a switchover is not the only reason one might depool apus (e.g., for operational reasons that should be adjacent to the Cephadm docs).

Zooming out, the "if you're pooling or depooling apus, do X to the docker registry" should be described in one and called out / referenced in all of:

@Scott_French @Blake can we update the description with the conclusion on what needs to be done here?

And does this need to be done this quarter, or can it wait until next one?

@MLechvien-WMF - I've updated the task description to capture the discussion here. My vote is that we get the "near-term mitigation" work done this quarter, while the issue is fresh in our minds.

I'll merge the exclusion patch and work on updating the docs tomorrow.

I'm inclined to just include a cumin command in the documentation, as a cookbook seems overly-specialized for this use case. Thanks!

@Scott_French Does something like https://wikitech.wikimedia.org/wiki/Ceph/Cephadm#Pooling_/_Depooling seem reasonable? If so, I'll proceed to link it in the docker-registry and switchover documentation.

Thanks, @Blake! Two thoughts:

First, it might be a good idea to highlight that this ideally shouldn't happen during a MediaWiki deployment - i.e., if one can choose a time to do this (not an emergency), then maybe coordinate in -operations to avoid that.

Second, while correct, the command you have is going to restart all of the registry services simultaneously. Given that they're fairly critical, maybe we can make that sequential instead by adding a -b 1, which will also have the effect of aborting the restart if either service fails to start during the restart.

That makes a lot of sense; I've updated the docs with those edits, thanks! I've also linked to these instructions in https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook and https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Special_cases.

The service has been excluded from the switchover, and documentation has been updated.