Page MenuHomePhabricator

gitlab-cloud-runners k8s cluster bootstrapping problem
Closed, ResolvedPublic

Description

Initial creation of the staging cloud runners cluster failed:
https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/jobs/220830

...
module.ingress.helm_release.this: Still creating... [2m40s elapsed]
module.ingress.helm_release.this: Creation complete after 2m48s [id=ingress-nginx]
โ•ท
โ”‚ Warning: Helm release "docker-hub-mirror" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
โ”‚ 
โ”‚   with module.docker-hub-mirror.helm_release.docker-hub-mirror,
โ”‚   on docker-hub-mirror/main.tf line 1, in resource "helm_release" "docker-hub-mirror":
โ”‚    1: resource "helm_release" "docker-hub-mirror" {
โ”‚ 
โ•ต
โ•ท
โ”‚ Warning: Helm release "reggie" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
โ”‚ 
โ”‚   with module.reggie.helm_release.this,
โ”‚   on reggie/main.tf line 1, in resource "helm_release" "this":
โ”‚    1: resource "helm_release" "this" {
โ”‚ 
โ•ต
โ•ท
โ”‚ Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "letsencrypt-production" namespace: "" from "": no matches for kind "ClusterIssuer" in version "cert-manager.io/v1"
โ”‚ ensure CRDs are installed first
โ”‚ 
โ”‚   with module.certs.helm_release.issuers,
โ”‚   on certs/main.tf line 25, in resource "helm_release" "issuers":
โ”‚   25: resource "helm_release" "issuers" {
โ”‚ 
โ•ต
โ•ท
โ”‚ Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=10s": http: server gave HTTP response to HTTPS client
โ”‚ 
โ”‚   with module.docker-hub-mirror.helm_release.docker-hub-mirror,
โ”‚   on docker-hub-mirror/main.tf line 1, in resource "helm_release" "docker-hub-mirror":
โ”‚    1: resource "helm_release" "docker-hub-mirror" {
โ”‚ 
โ•ต
โ•ท
โ”‚ Error: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=10s": http: server gave HTTP response to HTTPS client
โ”‚ 
โ”‚   with module.reggie.helm_release.this,
โ”‚   on reggie/main.tf line 1, in resource "helm_release" "this":
โ”‚    1: resource "helm_release" "this" {
โ”‚ 
โ•ต

The full transcript is in P58695

Notes

The root cause seems to be this bit:

Error: unable to build kubernetes objects from release manifest: resource mapping not found for name: "letsencrypt-production" namespace: "" from "": no matches for kind "ClusterIssuer" in version "cert-manager.io/v1"
ensure CRDs are installed first

with module.certs.helm_release.issuers,
on certs/main.tf line 25, in resource "helm_release" "issuers":
25: resource "helm_release" "issuers" {

Both reggie and docker-hub-mirror use an ingress with

annotations:
   cert-manager.io/cluster-issuer: letsencrypt-production

so it makes sense that they would have problems if the letsencrypt-production ClusterIssuer wasn't successfully created.

Details

TitleReferenceAuthorSource BranchDest Branch
Add dependencies to improve cluster bootstrappingrepos/releng/gitlab-cloud-runner!371dancymain-I0403bca4c729281b3dd398dbdb58ac038d443d5emain
Add depends_on settings to improve bootstrapping behaviorrepos/releng/gitlab-cloud-runner!352dancymain-Ifd1bc257db0693f2efebe85f1258705a07f4148emain
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptMar 8 2024, 6:59 PM
dancy changed the task status from Open to In Progress.Mar 8 2024, 10:38 PM
dancy triaged this task as Medium priority.

Need to retest staging cluster destroy/rebuild.