Page MenuHomePhabricator

gitlab-cloud-runner: Roll back pending helm releases before running terraform apply
Open, In Progress, MediumPublic

Description

If helm is interrupted in the middle of a deployment, it can end up in a state where subsequent deployments fail due to the prior deployment being in 'pending-upgrade' state. Terraform gets confused by this state, for example:

Terraform has been successfully initialized!
module.k8s-pvc-cleaner.helm_release.this: Creating...
╷
│ Error: cannot re-use a name that is still in use
│ 
│   with module.k8s-pvc-cleaner.helm_release.this,
│   on k8s-pvc-cleaner/main.tf line 1, in resource "helm_release" "this":
│    1: resource "helm_release" "this" {
│ 
╵

The way to recover from this state is to "helm rollback" the release in question.

Proposal:
In https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/.gitlab-ci.yml?ref_type=heads, template .deploy, before running terraform-init, etc, run helm -A list --pending to get a list of all pending helm releases in all namespaces, then roll back each of them using helm rollback <release>.

Update: There's a chicken and egg issue with this proposal since helm accesses the kubernetes cluster which requires the kubeconfig information that is established by terraform.

  • Reproduce the problem state by preparing a helm_release resource with a bad image reference, deploying it, and terminating the deployment job when it hangs.
  • See if we can avoid this situation altogether by passing atomic: true to the helm_release resource.

xref: https://github.com/hashicorp/terraform-provider-helm/issues/425

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/releng/gitlab-cloud-runner!342main-I8c5dd376f69a1fdc7d209b620ae16ab19442ee0amainsandeepsadded function to check cluster exists and validateTerraformState function.
repos/releng/gitlab-cloud-runner!340main-I97956b6e0a545e50c56858adae80c7f7330591bamainsandeepsupdate output message
repos/releng/gitlab-cloud-runner!339main-I538990edb491c83fa78b33b878e1ed0dd15ba588mainsandeepsset kubeconfig file permissions to restrict access
repos/releng/gitlab-cloud-runner!338main-Ife6f8bd550d890f24e950d3e1151f21439c14096mainsandeepsupdate base image reference to include helm
repos/releng/gitlab-cloud-runner!337main-I6528e3fdd1fc3080de75e40245dfed6c4a2af82cmainsandeepsupdating gitlab terraform image reference to include helm
repos/releng/gitlab-cloud-runner!336main-Ia6f07f7bcf7a190f190c844f8edf12ecfd7dbb47main-I8c5dd376f69a1fdc7d209b620ae16ab19442ee0asandeepsadding invalid image reference in pvc cleaner for testing purpose
repos/releng/gitlab-terraform-images!11use-trusted-tagwmf/stablesandeepsadd helm installation
repos/releng/gitlab-terraform-images!10use-trusted-tag-I0c8d7f8c5c6be031d2421edb4ae077c30cfa6f20use-trusted-tagsandeepsadd helm installation
repos/releng/gitlab-cloud-runner!332main-Id30d43bea400d2f04388e28a9b7f16266ae731a9mainsandeepsfix mismatch in kubeconfig output variable reference
repos/releng/gitlab-cloud-runner!331main-I087b382d284576c29fb7b9f96db2ac1d338e9463mainsandeepsadd outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.
repos/releng/gitlab-cloud-runner!317main-I255d9f68a857455c8341b50de5a3d1b65451d3a3mainsandeepshelm check script update and added kube_config variale in digital ocean output.tf
Show related patches Customize query in GitLab

Event Timeline

Sandeeps changed the task status from Open to In Progress.Jan 18 2024, 9:59 PM

Hi all, I wanted to update regarding the issue. As, I tried reproducing the error and doing Atomic = true setting didn't resolve the problem, and the deployment is still stuck in a locked state. I think we need more investigation on it.

sandeeps updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/317

helm check script update and added kube_config variale in digital ocean output.tf

sandeeps merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/317

helm check script update and added kube_config variale in digital ocean output.tf

sandeeps updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/331

add outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.

sandeeps merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/331

add outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.

sandeeps updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/342

added function to check cluster exists and validateTerraformState function.