Page MenuHomePhabricator

Post-merge build failed due to Internal Server Error
Closed, ResolvedPublic


The recommendation-api post-merge build failed in this patch due to unexpected HTTP status: 500 Internal Server Error as shown in this Jenkins post-merge build log.

We have tested the production image (created by blubber) that is supposed to be published to the wikimedia docker registry and it works well:

$ time docker build -t recommendation-api-prod20230718 .


Successfully built cff630cecacf
Successfully tagged recommendation-api-prod20230718:latest

real	6m11.289s
user	0m0.152s
sys	0m0.176s
$ docker images
REPOSITORY                                       TAG              IMAGE ID       CREATED          SIZE
recommendation-api-prod20230718                  latest           cff630cecacf   30 minutes ago   4.6GB

@elukey restarted the CI build hoping that this issue was transient but we got the same error as shown in this log.

What could be causing this Internal Server Error in the post-merge build?

Event Timeline

On registry2003 I can only see this error:

level=error msg="response completed with error" err.code="name unknown" err.detail="map[name:wikimedia/research-recommendation-api]" err.message="repository name not known to registry"

@hashar hi! Are we missing any config in the integration repo by any chance?

I built the image on contint1002 today. The image size is 4.6GB with one of the layers being 4.21GB. I tried pushing to the registry it kept failing when pushing the 4.21GB layer.

I browsed through the nginx logs via and found:

program:input-file-registry-nginx-access message: - ci-build [21/Jul/2023:16:17:20 +0000] "PATCH /v2/wikimedia/research-recommendation-api/blobs/uploads/d5a44664-d169-4552-b7f9-c08291670bae?_state=3cQcQ3IB-iq_IIZ4rnWKasAMOM6I5vwHzIRR_owNzVp7Ik5hbWUiOiJ3aWtpbWVkaWEvcmVzZWFyY2gtcmVjb21tZW5kYXRpb24tYXBpIiwiVVVJRCI6ImQ1YTQ0NjY0LWQxNjktNDU1Mi1iN2Y5LWMwODI5MTY3MGJhZSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMy0wNy0yMVQxNjowNDoyNy40MDQ0OTM5NzdaIn0%3D HTTP/1.1" 500 193 "-" "docker/20.10.12 go/go1.16.12 git-commit/459d0df kernel/4.19.0-23-amd64 os/linux arch/amd64

program:input-file-registry-nginx-error message:2023/07/21 16:17:19 [crit] 4999#4999: *33472 pwritev() "/var/lib/nginx/body/0000000827" has written only 7884 of 8184, client:, server: , request: "PATCH /v2/wikimedia/research-recommendation-api/blobs/uploads/d5a44664-d169-4552-b7f9-c08291670bae?_state=3cQcQ3IB-iq_IIZ4rnWKasAMOM6I5vwHzIRR_owNzVp7Ik5hbWUiOiJ3aWtpbWVkaWEvcmVzZWFyY2gtcmVjb21tZW5kYXRpb24tYXBpIiwiVVVJRCI6ImQ1YTQ0NjY0LWQxNjktNDU1Mi1iN2Y5LWMwODI5MTY3MGJhZSIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAyMy0wNy0yMVQxNjowNDoyNy40MDQ0OTM5NzdaIn0%3D HTTP/1.1", host: "docker-registry.discovery.wmnet" @timestamp:Jul 21, 2023 @ 16:17:20.139 facility:local0 logsource:registry2004 normalized_message:2023/07/21 16:17:19 [crit] 4999#4999: *33472 pwritev() "/var/lib/nginx/body/0000000827" has written only 7884 of 8184, client:, server: , request: "PATCH /v2/wikimedia/research-recommendation-api/blobs/uploads/d5a44664-d169-4552-b7f9-c08291 @version:1 host:registry2004 type:syslog tags:rsyslog-shipper, kafka, es, syslog, es, normalized_message_trimmed timestamp:2023-07-21T16:17:19.991797+00:00 level:NOTICE _id:okQ9eYkBZQQsSJrNzE-a _type: - _index:logstash-syslog-1-7.0.0-1-2023.07.21 _score: -

I'm wondering how much free space is available in /var/lib/nginx/body on the nginx server.

Digging up T288198 reveals that /var/lib/nginx is hosted on a tmpfs of size 2GB. That would explain the problem. I'll try to revive the discussion on that ticket.

Thank you for digging into this, @dancy! Enabling users to push to the docker-registry images with layers > 2GB (~4.21GB in our case) will unblock us on this issue.

To reduce image layer sizes, in T343576 we store and fetch the ~2.8GB recommendation-api embedding from Swift as recommended in T288198#9037109. This has enabled the post-merge build to succeed:

kevinbazira claimed this task.