This may be PEBKAC or a general misunderstanding of how things are supposed to work, but I am trying to do some experiments with GitLab CI and kokkuri for T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu and a thing that was working in job 343776 started failing for reasons that are unclear.
- Last pipeline that looks like it pushed to the registry: https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/pipelines/71051
- First pipeline that looks like it failed to pull from the registry: https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/pipelines/71060
There are jobs like https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/346763#L64 which show the image push failing for unknown reasons (400 BAD REQUEST).
There are jobs like https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/343810#L5 that show the runner suddenly deciding to pull the base image of the blubber managed container instead of the image that should have been created by the proceeding build-and-publish step. This is really unexpected and confusing.
- Is any of this even supposed to work from a "cloud" runner? Documentation seems to only discuss the "trusted" runner use case for WMF production network usage. I think @dduvall told me it should work though. There is certainly a different config that one hits from the cloud runner as WMF prod does not use registry.cloud.releng.team (which I think is a Digital Ocean hosted service).
- Did some quota bucket fill up to start causing the "400 BAD REQUEST" errors for the image push?
- Is there something I can turn on to get more informative traces of what is going on, especially what is leading to things like image: ${BUILD_DEPLOYER_IMAGE_REF} pulling the base image from the variant declaration in the blubber.yaml instead of the expected image?