Page MenuHomePhabricator

kokkuri cannot publish "public" images from WMCS runners due to a lack of a local registry
Closed, ResolvedPublicBUG REPORT

Description

In https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536780 I tried switching from the default Digital Ocean runners to the WMCS runners. This seems to have caused KOKKURI_REGISTRY_PUBLIC must be set to publish an image and the job to fail.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
reggie-values.yaml.tftpl: Allow write access from anywhererepos/releng/gitlab-cloud-runner!503dancymain-Ieb7060f5720ba4e08757619cba950c531277cb66main
Revert "ingress-nginx.yaml.tftpl: Enable proxy protocol again"repos/releng/gitlab-cloud-runner!502dancymain-Iaa934b8e3b4f34efa845050ee0e8dc6dbdce7ce4main
ingress-nginx.yaml.tftpl: Enable proxy protocol againrepos/releng/gitlab-cloud-runner!501dancymain-Ie01d309acfea88603c3c86f9318b42a2da699e5cmain
ingress-nginx.yaml.tftpl: Disable PROXY protocol stuffrepos/releng/gitlab-cloud-runner!500dancymain-I4af4fa52cde765cbf3c396b82c9dc048e97167aemain
Revert "reggie-values.yaml.tftpl: Enable jwt.debug"repos/releng/gitlab-cloud-runner!499dancymain-I7bbc03070288d0f1ca9ced6dda62ccf1c8cbe7edmain
ingress-nginx.yaml.tftpl: Drop `use-forwarded-headers: "true"`repos/releng/gitlab-cloud-runner!498dancymain-I24fb6a020a026999d08bec6a7d0c50b5c8fb7eb7main
ingress-nginx.yaml.tftpl: Enable proxy protocolrepos/releng/gitlab-cloud-runner!497dancymain-I20b28927f3c4a70fc0d3f2aa0c003ba4504e0746main
reggie-values.yaml.tftpl: Enable jwt.debugrepos/releng/gitlab-cloud-runner!495dancymain-I5e5b909bb38ec71b4a6aa05cf089209e4ef86c7dmain
reggie: Allow POST from arbitrary subnetsrepos/releng/gitlab-cloud-runner!494bd808work/bd808/reggie-from-wmcsmain
Customize query in GitLab

Event Timeline

The only change from the failing https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536780 to the passing https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536783 was removing the wmcs tag. This reinforces my superstition that there is some problem using kokkuri from the WMCS runners.

Only trusted runners are allowed to publish to the production docker registry. The wmcs runners are not trusted.

To clarify, the gitlab-cloud-runners do have a registry available for hosting temporary images. There is no corresponding thing in wmcs.

bd808 renamed this task from kokkuri doesn't seem to work from WMCS runners to kokkuri cannot publish "public" images from WMCS runners due to a lack of a local registry.Jun 18 2025, 4:41 PM
bd808 updated the task description. (Show Details)

In another version of this discussion @dancy wondered if we could just use the registry.cloud.releng.team Reggie instance. I am trying that out in https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/604835 by setting KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team in the job's envars:

2025-09-03 22:35:18,446 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.3.0', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=deployer', '--opt', 'context=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--opt', 'dockerfile=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--output', 'type=image,"name=registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604835",push=true']' returned non-zero exit status 1.

Relevant parts of the log prior to the final buildctl failure:

#18 [downloads] 🖥️ @65533 $ [script@66acb954]
  0     0    0     0    0     0      0      0 --:--:--  0:02:55 --:--:--     0
#18 ...
#18 [downloads] 🖥️ @65533 $ [script@66acb954]
  0     0    0     0    0     0      0      0 --:--:--  0:05:00 --:--:--     0
#18 301.1 curl: (28) SSL connection timeout
#18 ERROR: process "/66acb95481ceee2a9130647abfb0531592242bbf3f3a4fe9b746e4cdb5968c35/script" did not complete successfully: exit code: 28
------
 > [downloads] 🖥️ @65533 $ [script@66acb954]:
  0     0    0     0    0     0      0      0 --:--:--  0:05:00 --:--:--     0
301.1 curl: (28) SSL connection timeout
------
error: failed to solve: process "/66acb95481ceee2a9130647abfb0531592242bbf3f3a4fe9b746e4cdb5968c35/script" did not complete successfully: exit code: 28

My build in T396924#11146064 was failing because of github download rate limits I think. https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/604860 produced this error which is more explanatory:

------
 > exporting to image:
------
error: failed to solve: failed to push registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604860: unexpected status from POST request to https://registry.cloud.releng.team/v2/repos/releng/zuul/tofu-provisioning/blobs/uploads/: 403 Forbidden
2025-09-03 23:06:31,949 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.3.0', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=deployer', '--opt', 'context=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--opt', 'dockerfile=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--output', 'type=image,"name=registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604860",push=true']' returned non-zero exit status 1.

I think this means that the JWT auth to the registry failed?

@thcipriani found the same bits for me. So I guess we could poke a hole for 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org) or we setup a new registry in WMCS somewhere.

Change #1184830 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] buildkitd.toml.erb: Temporarily enable debug

https://gerrit.wikimedia.org/r/1184830

Change #1184830 abandoned by Ahmon Dancy:

[operations/puppet@production] buildkitd.toml.erb: Temporarily enable debug

Reason:

No longer needed

https://gerrit.wikimedia.org/r/1184830

@bd808 Pushes from WMCS to registry.cloud.releng.team and registry.staging.cloud.releng.team should be working now.

Looks like the changes have broken registry.cloud.releng.team. I will revert and try again tomorrow.

Looks like the changes have broken registry.cloud.releng.team. I will revert and try again tomorrow.

Nevermind. Things started working as expected after a while.

Looks like the changes have broken registry.cloud.releng.team. I will revert and try again tomorrow.

Nevermind. Things started working as expected after a while.

Hrm, I got this trying a push:

> exporting to image:
206
------
207
error: failed to solve: failed to push registry.cloud.releng.team/thcipriani/catalyst-ci-client:job-606170: failed to do request: Head "https://registry.cloud.releng.team/v2/thcipriani/catalyst-ci-client/blobs/sha256:199ff11a372ecbb2ca40f4415ae15439d82c9c9d8a6f8adc903542a9bcdccc00": http: server gave HTTP response to HTTPS client
208
2025-09-04 23:40:58,230 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.21.1', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=catalyst-client', '--metadata-file', '/tmp/tmp4tegtkbs', '--local', 'context=.', '--local', 'dockerfile=.', '--output', 'type=image,"name=registry.cloud.releng.team/thcipriani/catalyst-ci-client:job-606170",push=true']' returned non-zero exit status 1.

Normal curl seems to confirm:

curl -I -L 'https://registry.cloud.releng.team/v2/_catalog' 
curl: (35) TLS connect error: error:0A0000C6:SSL routines::packet length too long

Seems like it's sending back plain http:

telnet registry.cloud.releng.team 443
Trying <IP>...
Connected to registry.cloud.releng.team.
Escape character is '^]'.
HTTP/1.1 400 Bad Request
Date: Thu, 04 Sep 2025 23:58:09 GMT
Content-Type: text/html
Content-Length: 150
Connection: close

<html>
<head><title>400 Bad Request</title></head>
<body>
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx</center>
</body>
</html>
Connection closed by foreign host.

Settings have been reverted.

Settings have been reverted.

Thanks @dancy <3

It works! https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/607302

#27 exporting to image
#27 exporting layers
#27 exporting layers 40.7s done
#27 exporting manifest sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61 0.0s done
#27 exporting config sha256:e96ab06f3cbc94bd6c1590e966e286f2b3f507ceb904da0fc13fe1f4cbf8bcc6 0.0s done
#27 pushing layers
#27 pushing layers 6.8s done
#27 pushing manifest for registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-607302@sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61
#27 pushing manifest for registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-607302@sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61 0.3s done
#27 DONE 47.9s

At the moment a job will still need to set a KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team envvar in the job config to point kokkuri at the registry. That is a relatively simple addition and also something we can push up into the shared config at some point.

At the moment a job will still need to set a KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team envvar in the job config to point kokkuri at the registry. That is a relatively simple addition and also something we can push up into the shared config at some point.

@Jelto I see that https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/gitlab/gitlab-runner-values.yaml.tftpl is where the default KOKKURI_REGISTRY_PUBLIC value gets set for the DO runners. For the WMCS runners I think that it would be set using the profile::gitlab::runner::environment hiera dict. Is the Project Puppet in Horizon for the gitlab-runners project the right place to change that configuration? I don't want to make the setting in a place you and @Dzahn will never find it again if updates are needed.

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/acac1fe16f8c6ac1262e46633b815a9075feef51%5E%21/#F0

diff --git a/gitlab-runners/_.yaml b/gitlab-runners/_.yaml
index c70884c..24b63ab 100644
--- a/gitlab-runners/_.yaml
+++ b/gitlab-runners/_.yaml

@@ -1 +1,4 @@
 profile::gitlab::runner::concurrent: 4
+profile::gitlab::runner::environment:
+  KOKKURI_REGISTRY_CACHE: registry.cloud.releng.team
+  KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team
bd808 assigned this task to dancy.