Page MenuHomePhabricator

Image publishing via docker-pkg on build2001 repeatedly failing
Closed, ResolvedPublic

Description

At around 20:35 UTC today, I kicked off production images builds for https://gerrit.wikimedia.org/r/1116827 following the usual procedure in [0].

This progressed as normal through image builds and into the first (1 of 3) publish operation:

# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*php8.1*'               
== Step 0: scanning /srv/images/production-images/images/ ==
Will build the following images:
* docker-registry.discovery.wmnet/php8.1-cli:8.1.34-1-20250203
* docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203
* docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.34-1-20250203
== Step 1: building images ==
* Built image docker-registry.discovery.wmnet/php8.1-cli:8.1.34-1-20250203
* Built image docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203
* Built image docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.34-1-20250203
== Step 2: publishing ==
Successfully published image docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203

Shortly after that, publishing ground to a halt, while dockerd is consistently consuming a bit more than ~ 1 CPU-s/s.

Reading through the dockerd journal, we can see that that uploads are consistently failing with, e.g.

Feb 03 21:05:52 build2001 dockerd[656]: time="2025-02-03T21:05:52.403720965Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error"

and are retried, with a period of ~ 30m.

Zooming out, we can see the same thing was happening earlier today for about 7h, seemingly following an earlier image build attempt by @elukey.

Looking at that period, we can see the same repeated errors in the journal, and the same CPU usage signal: https://grafana.wikimedia.org/goto/rvH8UvKHg?orgId=1

We can also see that memory utilization is pretty wild, specifically page cache, which suggests(?) we're streaming over a lot of data, and is also consistent with the disk utilization metrics.

In any case, I've stopped my docker-pkg invocation, which will prevent further push retries from proceeding on the dockerd end - i.e., php8.1-cli and php8.1-fpm-multiversion-base remain back at 8.1.34-1-20250202 in the repository.

I don't have a great idea of how to proceed here. I don't have any visibility into why these pushes (or the earlier ones) would be particularly costly and involve so much data. The latter is also concerning, given the very limited scope of the changes that should picked up with the images (just the new mercurius packages).

Tagging serviceops and Infrastructure-Foundations for thoughts on how to proceed here.

[0] https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images

Event Timeline

I started build-production-images on build2001 in a tmux session and left running, I found it this morning and it successfully completed.

I kicked off another run of the same script to publish the php8.1 images, so far it seems that nothing is failing.

One thing worth to mention is the following:

registry2004's /var/log/nginx - error.log.1

elukey@registry2004:/var/log/nginx$ grep "No space left on device" error.log.1
2025/02/03 17:37:32 [crit] 1493378#1493378: *447502 pwrite() "/var/lib/nginx/body/0000043995" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/9c23a180-5d56-46f7-931b-b72feaa29d1b [..] HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 18:05:51 [crit] 1493378#1493378: *448641 pwrite() "/var/lib/nginx/body/0000044068" failed (28: No space left on device) while sending to client, client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/bf44b823-09b6-476d-8886-ba692913b70d [..]  HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 21:33:57 [crit] 1493379#1493379: *459700 pwrite() "/var/lib/nginx/body/0000044547" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/bd91a42b-515c-455c-b2aa-b281b3fc2cfa [..]  HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 22:02:01 [crit] 1493378#1493378: *464296 pwrite() "/var/lib/nginx/body/0000044687" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/c7b8c4d1-cd82-400c-971a-cc63b250fbf3 [..]  HTTP/1.1", host: "docker-registry.discovery.wmnet"

registry2005's /var/log/nginx - error.log.1

2025/02/03 13:01:23 [crit] 1681892#1681892: *243574 pwrite() "/var/lib/nginx/body/0000027387" failed (28: No space left on device), client: 10.192.32.7, server: , request: "PATCH /v2/restricted/mediawiki-multiversion/blobs/uploads/200485ef-a3e4-4d05-81cb-f17b4126badb [..] HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 13:05:42 [crit] 1681892#1681892: *243245 pwrite() "/var/lib/nginx/body/0000027330" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch22/blobs/uploads/7b32129c-c449-464c-8e8f-a3e050bb2993 [..] HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 14:46:16 [crit] 1681891#1681891: *252710 pwrite() "/var/lib/nginx/body/0000028492" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/85e8ed50-8c28-44a5-b9e0-a17aea3093f7 [..]  HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 16:12:26 [crit] 1681892#1681892: *258000 pwrite() "/var/lib/nginx/body/0000028755" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/76cb4e62-4797-45a7-8070-c6507ee42786 [..] HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 18:34:10 [crit] 1681892#1681892: *263888 pwrite() "/var/lib/nginx/body/0000029058" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/25790deb-12f8-42d9-b2a9-4298f222857c? [..]  HTTP/1.1", host: "docker-registry.discovery.wmnet"

2025/02/03 22:58:56 [crit] 1681892#1681892: *280160 pwrite() "/var/lib/nginx/body/0000029582" failed (28: No space left on device), client: 10.192.32.77, server: , request: "PATCH /v2/amd-pytorch25/blobs/uploads/8246894d-7255-496e-a739-949dd1e9f13b [..] HTTP/1.1", host: "docker-registry.discovery.wmnet"

Background: the Docker registry hosts run nginx as TLS terminator. It runs with a tmpfs 4G partition, that holds /var/lib/nginx. This means that when we try to upload a Docker image layer to the registry, it gets temporarily stored under /var/lib/nginx until all the bytes are uploaded by the client, to be then proxyed over to the Registry's daemon.

In T359067 we worked a lot to accomodate the ML use case: the Pytorch base images need a lot of python packages, carrying AMD GPU .so libs that occupy a lot of space. We used to have 2GB of space for /var/lib/nginx, and in that task we increased it to 4GB (the actual limit). Please note that a Docker image layer is compressed before being sent to the registry, so we are limiting to 4GB the max compressed layer size.

What I think it happened is that I kicked off a build, that caused the /var/lib/nginx dir to be almost filled, and anything else trying to upload at the same time caused a complete fill up, forcing nginx to return errors (hence dockerd on build2001 getting a 500).

I need to double check with @Scott_French the timings but I am almost sure that this is what is happening.

Next steps:

  • The obvious/easy next step would be to bump the registry's VM allocated RAM and the tmpfs partition too, but this would have the consequence of allowing even bigger images and we cannot really go ahead like this forever. It is surely an extreme measure that we can apply in case of real trouble, but we can hold off for the moment, see next steps.
  • Narrow down the Docker images/layers that are problematic. One is surely amd-pytorch25, so I'll follow up with ML, but maybe we do have others that need to be checked.

Added also a graph to the Registry's dashboard to quickly monitor what's happening:

Screenshot From 2025-02-04 10-23-20.png (964×1 px, 147 KB)

After chatting with @elukey, our current status is that:

  • this will happen again, so lookout for other builds being uploaded at the same time
  • find images with problematic layers

I checked the compressed image size for the updated amd-pytorch25 it now is 4.1 GB in total. (related patch from last week)
I don't know about the layer sizes, but it is indeed growing. The previous one was 3.8GB when compressed.

4,1G Feb  4 12:17 torch25.gz

Shall I revert that patch until we figure something better out?

The image was uploaded (https://docker-registry.wikimedia.org/amd-pytorch25/tags/ but I think we are getting really close to the limit. We could temporarily avoid to rebuild pytorch images weekly, so we are on the safe side.

Mentioned in SAL (#wikimedia-operations) [2025-02-04T12:04:51Z] <elukey> manually dropped 2.5.1rocm6.2-1-20250202 on build2001 - T385531

The image was uploaded (https://docker-registry.wikimedia.org/amd-pytorch25/tags/ but I think we are getting really close to the limit. We could temporarily avoid to rebuild pytorch images weekly, so we are on the safe side.

Correction - the last tag/version, 2.5.1rocm6.2-1, caused the tmpfs partition to fill up and nginx to return 500s. I reverted Ilias' patch, removed the 2.5.1rocm6.2-1 image (docker image rmi) from build2001 and re-ran build-production-images that completed successfully.

Summary: the main issue seemed to be the pytorch base image crossing the 4GB max compressed Docker layer size, it was reverted and cleaned up on build2001. As far as I can see all is good now, pending confirmation from Scott :)

Thank you all! I can confirm the 8.1.34-1-20250203 images are now present in the registry, so this seems to have succeeded on a subsequent build attempt :)

Just one note timing-wise:

I did check processes on build2001 before starting my builds around 20:30, and I didn't see anything running at the time (I might have missed something, though!). You can also see a gap in the dockerd journal between between ~ 18:30 and 20:30, which also suggests nothing was happening.

That does make me a bit concerned that this can happen even without concurrent image pushes (and these images should be well less than 4GiB compressed). Two thoughts:

  • Is it possible that the tmpfs staging area can be "polluted" in some way by earlier failures? (e.g., those visible in the journal through 18:30 UTC)
  • Is cleanup of the tmpfs staging area somehow asynchronous? (i.e., if pushing multiple images, do we need to wait longer between them)

I think that the issue wasn't only concurrent docker layers being pushed to the registry (it is a concern though), but probably just one of the new Pytorch image layers crossing the 4GB limit. If you check the graph that I added in T385531#10520075 you'll see an incremental use of the tmpfs partition until it reaches 100%, forcing nginx to return 500 to dockerd. It takes a ton of time to push all that data, and I think build2001's dockerd does one image layer push at the time (so it doesn't do anything else in parallel). The ML team reverted the change, I removed it from the local docker build cache on build2001 and everything worked nicely.

To answer your questions:

That does make me a bit concerned that this can happen even without concurrent image pushes (and these images should be well less than 4GiB compressed). Two thoughts:

  • Is it possible that the tmpfs staging area can be "polluted" in some way by earlier failures? (e.g., those visible in the journal through 18:30 UTC)

I think that nginx automatically cleans up everything, so it shouldn't be a concern, but I am not 100% sure.

  • Is cleanup of the tmpfs staging area somehow asynchronous? (i.e., if pushing multiple images, do we need to wait longer between them)

Not sure, but from the graphs of the space used by tmpfs it seems that nginx quickly recovers all the allocated data once it finishes (for example, when tmpfs reaches 100% and it returns HTTP 500 to dockerd).

Ah, thanks for clarifying @elukey. It didn't occur to me that the layers being pushed (and failing) might not have been the ones for the images I had just built. I wonder if there's some way to surface that better (I'll give that some thought).

elukey claimed this task.

Closing since the problem was related to a new ML image that was rolled back.