At around 20:35 UTC today, I kicked off production images builds for https://gerrit.wikimedia.org/r/1116827 following the usual procedure in [0].
This progressed as normal through image builds and into the first (1 of 3) publish operation:
# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*php8.1*' == Step 0: scanning /srv/images/production-images/images/ == Will build the following images: * docker-registry.discovery.wmnet/php8.1-cli:8.1.34-1-20250203 * docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203 * docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.34-1-20250203 == Step 1: building images == * Built image docker-registry.discovery.wmnet/php8.1-cli:8.1.34-1-20250203 * Built image docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203 * Built image docker-registry.discovery.wmnet/php8.1-fpm-multiversion-base:8.1.34-1-20250203 == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/php8.1-fpm:8.1.34-1-20250203
Shortly after that, publishing ground to a halt, while dockerd is consistently consuming a bit more than ~ 1 CPU-s/s.
Reading through the dockerd journal, we can see that that uploads are consistently failing with, e.g.
Feb 03 21:05:52 build2001 dockerd[656]: time="2025-02-03T21:05:52.403720965Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error"
and are retried, with a period of ~ 30m.
Zooming out, we can see the same thing was happening earlier today for about 7h, seemingly following an earlier image build attempt by @elukey.
Looking at that period, we can see the same repeated errors in the journal, and the same CPU usage signal: https://grafana.wikimedia.org/goto/rvH8UvKHg?orgId=1
We can also see that memory utilization is pretty wild, specifically page cache, which suggests(?) we're streaming over a lot of data, and is also consistent with the disk utilization metrics.
In any case, I've stopped my docker-pkg invocation, which will prevent further push retries from proceeding on the dockerd end - i.e., php8.1-cli and php8.1-fpm-multiversion-base remain back at 8.1.34-1-20250202 in the repository.
I don't have a great idea of how to proceed here. I don't have any visibility into why these pushes (or the earlier ones) would be particularly costly and involve so much data. The latter is also concerning, given the very limited scope of the changes that should picked up with the images (just the new mercurius packages).
Tagging serviceops and Infrastructure-Foundations for thoughts on how to proceed here.
[0] https://wikitech.wikimedia.org/wiki/Kubernetes/Images#Production_images
