In https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536780 I tried switching from the default Digital Ocean runners to the WMCS runners. This seems to have caused KOKKURI_REGISTRY_PUBLIC must be set to publish an image and the job to fail.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| buildkitd.toml.erb: Temporarily enable debug | operations/puppet | production | +3 -0 |
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| reggie-values.yaml.tftpl: Allow write access from anywhere | repos/releng/gitlab-cloud-runner!503 | dancy | main-Ieb7060f5720ba4e08757619cba950c531277cb66 | main | |
| Revert "ingress-nginx.yaml.tftpl: Enable proxy protocol again" | repos/releng/gitlab-cloud-runner!502 | dancy | main-Iaa934b8e3b4f34efa845050ee0e8dc6dbdce7ce4 | main | |
| ingress-nginx.yaml.tftpl: Enable proxy protocol again | repos/releng/gitlab-cloud-runner!501 | dancy | main-Ie01d309acfea88603c3c86f9318b42a2da699e5c | main | |
| ingress-nginx.yaml.tftpl: Disable PROXY protocol stuff | repos/releng/gitlab-cloud-runner!500 | dancy | main-I4af4fa52cde765cbf3c396b82c9dc048e97167ae | main | |
| Revert "reggie-values.yaml.tftpl: Enable jwt.debug" | repos/releng/gitlab-cloud-runner!499 | dancy | main-I7bbc03070288d0f1ca9ced6dda62ccf1c8cbe7ed | main | |
| ingress-nginx.yaml.tftpl: Drop `use-forwarded-headers: "true"` | repos/releng/gitlab-cloud-runner!498 | dancy | main-I24fb6a020a026999d08bec6a7d0c50b5c8fb7eb7 | main | |
| ingress-nginx.yaml.tftpl: Enable proxy protocol | repos/releng/gitlab-cloud-runner!497 | dancy | main-I20b28927f3c4a70fc0d3f2aa0c003ba4504e0746 | main | |
| reggie-values.yaml.tftpl: Enable jwt.debug | repos/releng/gitlab-cloud-runner!495 | dancy | main-I5e5b909bb38ec71b4a6aa05cf089209e4ef86c7d | main | |
| reggie: Allow POST from arbitrary subnets | repos/releng/gitlab-cloud-runner!494 | bd808 | work/bd808/reggie-from-wmcs | main |
Related Objects
Event Timeline
The only change from the failing https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536780 to the passing https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/536783 was removing the wmcs tag. This reinforces my superstition that there is some problem using kokkuri from the WMCS runners.
Only trusted runners are allowed to publish to the production docker registry. The wmcs runners are not trusted.
To clarify, the gitlab-cloud-runners do have a registry available for hosting temporary images. There is no corresponding thing in wmcs.
In another version of this discussion @dancy wondered if we could just use the registry.cloud.releng.team Reggie instance. I am trying that out in https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/604835 by setting KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team in the job's envars:
2025-09-03 22:35:18,446 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.3.0', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=deployer', '--opt', 'context=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--opt', 'dockerfile=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--output', 'type=image,"name=registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604835",push=true']' returned non-zero exit status 1.
Relevant parts of the log prior to the final buildctl failure:
#18 [downloads] 🖥️ @65533 $ [script@66acb954] 0 0 0 0 0 0 0 0 --:--:-- 0:02:55 --:--:-- 0 #18 ... #18 [downloads] 🖥️ @65533 $ [script@66acb954] 0 0 0 0 0 0 0 0 --:--:-- 0:05:00 --:--:-- 0 #18 301.1 curl: (28) SSL connection timeout #18 ERROR: process "/66acb95481ceee2a9130647abfb0531592242bbf3f3a4fe9b746e4cdb5968c35/script" did not complete successfully: exit code: 28 ------ > [downloads] 🖥️ @65533 $ [script@66acb954]: 0 0 0 0 0 0 0 0 --:--:-- 0:05:00 --:--:-- 0 301.1 curl: (28) SSL connection timeout ------ error: failed to solve: process "/66acb95481ceee2a9130647abfb0531592242bbf3f3a4fe9b746e4cdb5968c35/script" did not complete successfully: exit code: 28
My build in T396924#11146064 was failing because of github download rate limits I think. https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/604860 produced this error which is more explanatory:
------ > exporting to image: ------ error: failed to solve: failed to push registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604860: unexpected status from POST request to https://registry.cloud.releng.team/v2/repos/releng/zuul/tofu-provisioning/blobs/uploads/: 403 Forbidden 2025-09-03 23:06:31,949 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v1.3.0', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=deployer', '--opt', 'context=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--opt', 'dockerfile=https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning.git#work/bd808/wmcs-runners', '--output', 'type=image,"name=registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-604860",push=true']' returned non-zero exit status 1.
I think this means that the JWT auth to the registry failed?
It looks like we do have things configured to deny non-GET/HEAD requests to Reggie that do not originate from the gitlab-cloud-runners cluster:
https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/ingress/values/ingress-nginx.yaml.tftpl?ref_type=heads#L10
https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/reggie/reggie-values.yaml.tftpl?ref_type=heads#L27
@thcipriani found the same bits for me. So I guess we could poke a hole for 185.15.56.1 (nat.cloudgw.eqiad1.wikimediacloud.org) or we setup a new registry in WMCS somewhere.
bd808 opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/494
reggie: Allow POST from arbitrary subnets
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/494
reggie: Allow POST from arbitrary subnets
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/495
reggie-values.yaml.tftpl: Enable jwt.debug
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/495
reggie-values.yaml.tftpl: Enable jwt.debug
Change #1184830 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):
[operations/puppet@production] buildkitd.toml.erb: Temporarily enable debug
Change #1184830 abandoned by Ahmon Dancy:
[operations/puppet@production] buildkitd.toml.erb: Temporarily enable debug
Reason:
No longer needed
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/497
ingress-nginx.yaml.tftpl: Enable proxy protocol
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/497
ingress-nginx.yaml.tftpl: Enable proxy protocol
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/498
ingress-nginx.yaml.tftpl: Drop use-forwarded-headers: "true"
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/498
ingress-nginx.yaml.tftpl: Drop use-forwarded-headers: "true"
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/499
Revert "reggie-values.yaml.tftpl: Enable jwt.debug"
@bd808 Pushes from WMCS to registry.cloud.releng.team and registry.staging.cloud.releng.team should be working now.
Looks like the changes have broken registry.cloud.releng.team. I will revert and try again tomorrow.
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/500
ingress-nginx.yaml.tftpl: Disable PROXY protocol stuff
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/499
Revert "reggie-values.yaml.tftpl: Enable jwt.debug"
dancy closed https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/500
ingress-nginx.yaml.tftpl: Disable PROXY protocol stuff
Hrm, I got this trying a push:
> exporting to image: 206 ------ 207 error: failed to solve: failed to push registry.cloud.releng.team/thcipriani/catalyst-ci-client:job-606170: failed to do request: Head "https://registry.cloud.releng.team/v2/thcipriani/catalyst-ci-client/blobs/sha256:199ff11a372ecbb2ca40f4415ae15439d82c9c9d8a6f8adc903542a9bcdccc00": http: server gave HTTP response to HTTPS client 208 2025-09-04 23:40:58,230 Command '['buildctl', '--timeout', '3600', '--wait', 'build', '--progress=plain', '--frontend=gateway.v0', '--opt', 'source=docker-registry.wikimedia.org/repos/releng/blubber/buildkit:v0.21.1', '--opt', 'filename=.pipeline/blubber.yaml', '--opt', 'target=catalyst-client', '--metadata-file', '/tmp/tmp4tegtkbs', '--local', 'context=.', '--local', 'dockerfile=.', '--output', 'type=image,"name=registry.cloud.releng.team/thcipriani/catalyst-ci-client:job-606170",push=true']' returned non-zero exit status 1.
Normal curl seems to confirm:
curl -I -L 'https://registry.cloud.releng.team/v2/_catalog' curl: (35) TLS connect error: error:0A0000C6:SSL routines::packet length too long
Seems like it's sending back plain http:
telnet registry.cloud.releng.team 443 Trying <IP>... Connected to registry.cloud.releng.team. Escape character is '^]'. HTTP/1.1 400 Bad Request Date: Thu, 04 Sep 2025 23:58:09 GMT Content-Type: text/html Content-Length: 150 Connection: close <html> <head><title>400 Bad Request</title></head> <body> <center><h1>400 Bad Request</h1></center> <hr><center>nginx</center> </body> </html> Connection closed by foreign host.
dancy reopened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/500
ingress-nginx.yaml.tftpl: Disable PROXY protocol stuff
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/500
ingress-nginx.yaml.tftpl: Disable PROXY protocol stuff
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/501
ingress-nginx.yaml.tftpl: Enable proxy protocol again
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/501
ingress-nginx.yaml.tftpl: Enable proxy protocol again
dancy updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/502
Revert "ingress-nginx.yaml.tftpl: Enable proxy protocol again"
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/502
Revert "ingress-nginx.yaml.tftpl: Enable proxy protocol again"
dancy opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/503
reggie-values.yaml.tftpl: Allow write access from anywhere
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/503
reggie-values.yaml.tftpl: Allow write access from anywhere
It works! https://gitlab.wikimedia.org/repos/releng/zuul/tofu-provisioning/-/jobs/607302
#27 exporting to image #27 exporting layers #27 exporting layers 40.7s done #27 exporting manifest sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61 0.0s done #27 exporting config sha256:e96ab06f3cbc94bd6c1590e966e286f2b3f507ceb904da0fc13fe1f4cbf8bcc6 0.0s done #27 pushing layers #27 pushing layers 6.8s done #27 pushing manifest for registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-607302@sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61 #27 pushing manifest for registry.cloud.releng.team/repos/releng/zuul/tofu-provisioning:job-607302@sha256:c22163495bf32ee746192e8fd4feac9abe5752aad6d85e86e52e35ac7ea25d61 0.3s done #27 DONE 47.9s
At the moment a job will still need to set a KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team envvar in the job config to point kokkuri at the registry. That is a relatively simple addition and also something we can push up into the shared config at some point.
@Jelto I see that https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/gitlab/gitlab-runner-values.yaml.tftpl is where the default KOKKURI_REGISTRY_PUBLIC value gets set for the DO runners. For the WMCS runners I think that it would be set using the profile::gitlab::runner::environment hiera dict. Is the Project Puppet in Horizon for the gitlab-runners project the right place to change that configuration? I don't want to make the setting in a place you and @Dzahn will never find it again if updates are needed.
diff --git a/gitlab-runners/_.yaml b/gitlab-runners/_.yaml index c70884c..24b63ab 100644 --- a/gitlab-runners/_.yaml +++ b/gitlab-runners/_.yaml @@ -1 +1,4 @@ profile::gitlab::runner::concurrent: 4 +profile::gitlab::runner::environment: + KOKKURI_REGISTRY_CACHE: registry.cloud.releng.team + KOKKURI_REGISTRY_PUBLIC: registry.cloud.releng.team