Latest instance, job #1687:
# … 14:09:38 [mediawiki-publish-83] sent 6,260,716 bytes received 19,893 bytes 2,512,243.60 bytes/sec 14:09:38 [mediawiki-publish-83] total size is 9,401,368,757 speedup is 1,496.89 14:09:39 [mediawiki-publish-83] Commit docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83 14:09:39 [mediawiki-publish-83] Running docker commit -c ENTRYPOINT ["/usr/sbin/php-fpm8.3", "--nodaemonize", "--fpm-config", "/etc/php/8.3/fpm/php-fpm.conf"] -c LABEL vnd.wikimedia.builder. name="scap" -c LABEL vnd.wikimedia.builder.version="4.243.0" -c LABEL vnd.wikimedia.scap.stage_dir="/srv/mediawiki-staging" -c LABEL vnd.wikimedia.scap.build_state_dir="/srv/mediawiki-stagin g/scap/image-build" -c LABEL vnd.wikimedia.mediawiki.versions="1.46.0-wmf.21,1.46.0-wmf.22" -c LABEL vnd.wikimedia.build-type=incremental -c LABEL vnd.wikimedia.parent-image=docker-registry. discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-135907-publish-83 rsync-2719044684066940925 docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-p ublish-83 14:09:39 [mediawiki-publish-83] sha256:2daddb008c9fc63de8fba6a5084044250cebac2cd7ea5cbb937fd3206edb96dc 14:09:39 [mediawiki-publish-83] Pushing docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83 14:09:39 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2026-04-02-140931-publish-83 14:10:32 [mediawiki-publish-83] blob upload unknown 14:10:32 [mediawiki-publish-83] Traceback (most recent call last): # (snip, Python-side traceback is uninteresting I think – it just complains that the command exited nonzero)
Previous SpiderPigs: #1684, #1685, #1686.
This is currently blocking all deployments, including the fix for UBN T422143.
Trigger
Repooling apus.discovery.wmnet in codfw (active DC for the docker registry) on day 8 of the DC switchover led to a situation where the s3 driver in the apus-backed registry instances (such as the one for MediaWiki restricted images) was issuing operations against a mix of apus backends in eqiad and codfw. Since cross-DC replication is asynchronous, this led to persistent consistency issues until the registry instances were restarted, thus clearing cached connections to apus. See T422166#11783170.
Near-term mitigation (this task)
- The apus service will be excluded from the switchover (i.e., apus.discovery.wmnet will not be touched on day 1 or 8).
- The manual switchover process for apus (e.g., for DC-wide maintenance or a proper disaster scenario) will be documented and cross-linked in the relevant locations linked in T422166#11808965. Following a change in pooled state on the discovery service, we need to restart docker-registry-restricted.service and docker-registry-ml.service on A:docker-registry hosts in a safe / paced way (e.g., one a time). This could be achieved with a cumin command, or a dedicated cookbook.
Long-term fix (not in scope for this task)
We have reason to believe that the overly aggressive connection caching in the s3 driver should go away in later versions of docker registry. Upgrading is blocked on retaining support for swift, and will likely not see action until next FY.