Page MenuHomePhabricator

failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid
Closed, DuplicatePublic

Description

@amastilovic has reported that pushing an airflow-dags image is repeatedly failing:

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/635769
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/635747
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/635737

ERROR: failed to push docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-10-03-192132-3003d4328df66a0086a350fdd2ba1dbd80a235c5: unknown: blob upload invalid

This smells like T390251 to me.

Event Timeline

In the two cases I spot-checked on registry2004 - around 19:32 and 20:53 UTC respectively - these both appear to be err.code="blob upload invalid" err.detail="blob invalid length" on the PUT of a two-phase monolithic blob upload.

Example logs for the 20:53 occurrence: P83596. A couple of notes:

  • I've added comments to annotate what appears to be happening.
  • I've pruned extraneous log lines unrelated to this upload attempt. Notably, there are a number of current blob uploads that appear to be happening as part of the same push.
  • Unfortunately, this does not tell us what blob length was observed by the registry, nor do the nginx access logs (not included) appear to capture request size.

In this particular case, the affected blob is sha256:070addc8e01691d23f89009ae06052bd55421ae270d5dd15414f5ec2cd9f58f0, which is different from 19:32, but unsurprising even if they're logically the "same" (both seem to take around the same time to upload, so are presumably around the same size).

@dancy - Have we changed anything about the buildkit version used in the runners recently? These identify themselves as the buildkit/v0.22-dev user-agent.

@dancy - Have we changed anything about the buildkit version used in the runners recently? These identify themselves as the buildkit/v0.22-dev user-agent.

I upgraded the gitlab-cloud-runner's buildkit this week, but these jobs ran on the Trusted Runners which I have not upgraded yet.

  • I've pruned extraneous log lines unrelated to this upload attempt. Notably, there are a number of current blob uploads that appear to be happening as part of the same push.

Probably related. Replication load.

  • Unfortunately, this does not tell us what blob length was observed by the registry, nor do the nginx access logs (not included) appear to capture request size.

Indeed. Exactly what we see in T390251 where Swift is seeing an inconsistent view of a blob.

Thanks for the follow-up, @dancy.

FWIW, I just combed the codfw swift-fe logs for the 20:53 minute, and I can't see anything unusual happening per se - i.e., I just know that a series of sensible-looking operations succeeded on the relevant _uploads sub-path for the failed b9ce4c98-1f88-4116-a381-5e2ad9a4fc17 upload.

That said, I do see two hashstates updates, presumably corresponding to the start and end of the upload - hashstates/sha256/0 and hashstates/sha256/469066129 - the latter presumably telling us the size of the blob (i.e., final offset).

If that's true, then it's curious that at < 450 MiB blob would be sufficiently large to trigger the swift-consistency issue we've seen previously, but exceptionally so.

In any case, as usual, I find myself wishing blobWriter.validateBlob actually logged what StorageDriver.Stat actually returned for a size if it does not match the descriptor.

@amastilovic - Have you subsequently been able to successfully build / push the docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags image?

It's a long road to migrating the registry from Swift to apus Ceph as the long-term solution for T390251, and even that is currently only focused on the /restricted prefix of the namespace (though that could change).

Basically, if this wasn't just transient Swift-cluster "weather" resulting in unusually slow metadata consistency convergence, we'll have to explore other tactical solutions (e.g., figuring out a strategy for deploying @dancy's Swift-driver improvements).

@amastilovic - Have you subsequently been able to successfully build / push the docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags image?

I asked him about this on Slack yesterday and he reported that the problem still persists.

@amastilovic - Have you subsequently been able to successfully build / push the docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags image?

As @dancy already noted, I've tried yesterday with no success. I've tried this morning too, however, and the job succeeded: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/638531

Not sure what to make of all this, except maybe to hope that I won't have to wait a couple of days between successful builds. I wonder if the problem would have magically resolved if I simply continued restarting the build job over and over again instead of spreading it out over a couple of days.

@amastilovic - Thanks for the follow-up. Glad to hear this worked eventually, but very much agreed that it should not take days.

@dancy - From the registry logs, it seems like the failing blob upload was attempted only once, before the entire push was failed. Do you know if retries are supported in buildkit, and if so, whether there's a way to enable that here? If I remember correctly, a push via dockerd will retry a failed blob upload a couple of times, which occasionally gets us out of Swift-related trouble when uploading large MediaWiki image layer blobs during deployments.

Two other differences that come to mind vs. the MediaWiki image push-via-dockerd case (mostly noting for completeness, not as workarounds):

  • Upload type - I believe the latter uses what are technically chunked uploads, despite sending only a single chunk PATCH in practice. Notably, that does split the segment writes and the commit (where the size check happens) into two separate phases (i.e., possibly allowing the dust to settle a bit consistency-wise), rather than doing it all in the span of a single PUT as we have here.
  • Concurrency - My vague recollection is that in the latter case, dockerd is only uploading one blob at a time, but that's definitely not the case here.

Of course, none of this is an explanation for why the known-faulty Swift storage driver is suddenly behaving worse than before. Had this happened between 9/23 and 10/2, I would have considered pointing at the additional load on Swift in codfw while eqiad was depooled, but that clearly does not align. Swift in codfw is a bit busier in terms of mutating operations now that it's the primary datacenter, but it's hard to quantify how much of an impact that might have.

@dancy - From the registry logs, it seems like the failing blob upload was attempted only once, before the entire push was failed. Do you know if retries are supported in buildkit, and if so, whether there's a way to enable that here?

Buildkit configuration docs don't show any setting for configurable retries.

https://github.com/moby/buildkit/blob/master/util/push/push.go#L48 seems like the function that is used for pushing in buildkit (based on the "pushing layers" message at line 130). Line 108 shows the creation of a retryhandler. That is defined in https://github.com/moby/buildkit/blob/master/util/resolver/retryhandler/retry.go. Looks like the retry decision is made in retryError on line 54. It does looks like it intends to return true if it gets a 5xx error.

In case it helps, I've got another: blob upload invalid error, but from a different repo:
https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693467

error: failed to solve: failed to push docker-registry.discovery.wmnet/repos/data-engineering/spark:3.5.7-2025-12-03-170322-8b2635bddd041c7d618af2ff173c6666e3d6ace9: unknown: blob upload invalid

It's a long road to migrating the registry from Swift to apus Ceph as the long-term solution for T390251, and even that is currently only focused on the /restricted prefix of the namespace (though that could change).

Do you know if there were updates on the testing front? I don't see much in T390251, but my understanding is that we are not really using the apus atm as backend for /restricted right? I'd restart the conversation, we really need to get away from Swift.

Basically, if this wasn't just transient Swift-cluster "weather" resulting in unusually slow metadata consistency convergence, we'll have to explore other tactical solutions (e.g., figuring out a strategy for deploying @dancy's Swift-driver improvements).

I am suggesting that we really focus on tactical solutions, because if we count the amount of time spent on this problem we get a really high number. I was in favor of proceeding with the apus solution rather than patching the registry's swift driver, but at this point we may want to reconsider that option. I'll follow up in the other task, I think we really need a working group for this.

jijiki added a project: ServiceOps new.
jijiki removed a project: serviceops.
jijiki added a project: Kubernetes.
jijiki changed the task status from Open to Stalled.Thu, Jan 22, 2:06 PM
jijiki triaged this task as High priority.

Since this is fundamentally the same class of failure mode as already tracked in T390251, I am going to duplicate this into the latter as canonical.

I'll carry over some points of note to that task in order to make sure they're not lost, though.