Page MenuHomePhabricator

Push by kokkuri to registry.cloud.releng.team/bd808/deployment-prep-opentofu failing after working last week
Closed, InvalidPublicBUG REPORT

Description

This may be PEBKAC or a general misunderstanding of how things are supposed to work, but I am trying to do some experiments with GitLab CI and kokkuri for T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu and a thing that was working in job 343776 started failing for reasons that are unclear.

There are jobs like https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/346763#L64 which show the image push failing for unknown reasons (400 BAD REQUEST).

There are jobs like https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/jobs/343810#L5 that show the runner suddenly deciding to pull the base image of the blubber managed container instead of the image that should have been created by the proceeding build-and-publish step. This is really unexpected and confusing.

  • Is any of this even supposed to work from a "cloud" runner? Documentation seems to only discuss the "trusted" runner use case for WMF production network usage. I think @dduvall told me it should work though. There is certainly a different config that one hits from the cloud runner as WMF prod does not use registry.cloud.releng.team (which I think is a Digital Ocean hosted service).
  • Did some quota bucket fill up to start causing the "400 BAD REQUEST" errors for the image push?
  • Is there something I can turn on to get more informative traces of what is going on, especially what is leading to things like image: ${BUILD_DEPLOYER_IMAGE_REF} pulling the base image from the variant declaration in the blubber.yaml instead of the expected image?

Event Timeline

This may have all been PEBCAK in some form. I rolled back to the blubber and kokkuri config that worked for https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/pipelines/71051 and the push failure went away. I am now slowly building on that change to try and get to the desired end state of a pipeline that can do a tofu apply from a protected ref. This time I am stacking changes rather than force pushing so it will hopefully be easier to spot a breaking change if it is introduced.

https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/-/pipelines/71607 shows everything working until secrets are needed in the manually triggered tofu-apply job. That is a different problem (https://docs.gitlab.com/ee/ci/variables/index.html#protect-a-cicd-variable says protected variables aren't passed to MR pipelines, but I swear I've done this before).

I'm going to close this as invalid because backing up and going forward again made things work. I still haven't spotted what I did in the stack of force-pushed changes that made reggie (the DO hosted registry) mad at me, but it seems to be fine now. If @dduvall wants to repoen because he actually found something in the backend logs worth pursuing I have no objections.