gitlab-cloud-runner: Roll back pending helm releases before running terraform apply
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• dancy
	Jan 10 2024, 5:14 PM

Description

If helm is interrupted in the middle of a deployment, it can end up in a state where subsequent deployments fail due to the prior deployment being in 'pending-upgrade' state. Terraform gets confused by this state, for example:

Terraform has been successfully initialized!
module.k8s-pvc-cleaner.helm_release.this: Creating...
╷
│ Error: cannot re-use a name that is still in use
│ 
│   with module.k8s-pvc-cleaner.helm_release.this,
│   on k8s-pvc-cleaner/main.tf line 1, in resource "helm_release" "this":
│    1: resource "helm_release" "this" {
│ 
╵

The way to recover from this state is to "helm rollback" the release in question.

Proposal:
In https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/.gitlab-ci.yml?ref_type=heads, template .deploy, before running terraform-init, etc, run helm -A list --pending to get a list of all pending helm releases in all namespaces, then roll back each of them using helm rollback <release>.

Update: There's a chicken and egg issue with this proposal since helm accesses the kubernetes cluster which requires the kubeconfig information that is established by terraform.

Reproduce the problem state by preparing a helm_release resource with a bad image reference, deploying it, and terminating the deployment job when it hangs.
See if we can avoid this situation altogether by passing atomic: true to the helm_release resource.

xref: https://github.com/hashicorp/terraform-provider-helm/issues/425

Details

Title	Reference	Author	Source Branch	Dest Branch
.gitlab-ci.yml: Run terraform plan before running helm-check.py	repos/releng/gitlab-cloud-runner!363	dancy	main-Ide1d7e496520cd3855833cede11045801d73edbe	main
Pass $ARGS to helm-check.py in .gitlab-ci.yml	repos/releng/gitlab-cloud-runner!361	dancy	main-I75caf636ae02c9c3b62428b631b4f05ea6e9c55f	main
helm-check.py: Terraform refresh before collecting outputs	repos/releng/gitlab-cloud-runner!360	dancy	main-I08897ed57b5036edab0629642ba5952a9054ac64	main
Rewrite helm-check in python and Improve handling of helm release status in script	repos/releng/gitlab-cloud-runner!359	sandeeps	main-I9fdfc32376eb32d4e572328f9247be1628494c39	main
Improve handling of helm release status in script	repos/releng/gitlab-cloud-runner!357	sandeeps	main-Id6e3321224bbd55646da54906b09ff517bc43dba	main
update cluster version prefix to 1.29.	repos/releng/gitlab-cloud-runner!347	sandeeps	main-I33d916e34956a605583454c64f1ab747772cd5a4	main
added function to check cluster exists and validateTerraformState function.	repos/releng/gitlab-cloud-runner!342	sandeeps	main-I8c5dd376f69a1fdc7d209b620ae16ab19442ee0a	main
update output message	repos/releng/gitlab-cloud-runner!340	sandeeps	main-I97956b6e0a545e50c56858adae80c7f7330591ba	main
set kubeconfig file permissions to restrict access	repos/releng/gitlab-cloud-runner!339	sandeeps	main-I538990edb491c83fa78b33b878e1ed0dd15ba588	main
update base image reference to include helm	repos/releng/gitlab-cloud-runner!338	sandeeps	main-Ife6f8bd550d890f24e950d3e1151f21439c14096	main
updating gitlab terraform image reference to include helm	repos/releng/gitlab-cloud-runner!337	sandeeps	main-I6528e3fdd1fc3080de75e40245dfed6c4a2af82c	main
adding invalid image reference in pvc cleaner for testing purpose	repos/releng/gitlab-cloud-runner!336	sandeeps	main-Ia6f07f7bcf7a190f190c844f8edf12ecfd7dbb47	main
add helm installation	repos/releng/gitlab-terraform-images!11	sandeeps	use-trusted-tag	wmf/stable
add helm installation	repos/releng/gitlab-terraform-images!10	sandeeps	use-trusted-tag-I0c8d7f8c5c6be031d2421edb4ae077c30cfa6f20	use-trusted-tag
fix mismatch in kubeconfig output variable reference	repos/releng/gitlab-cloud-runner!332	sandeeps	main-Id30d43bea400d2f04388e28a9b7f16266ae731a9	main
add outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.	repos/releng/gitlab-cloud-runner!331	sandeeps	main-I087b382d284576c29fb7b9f96db2ac1d338e9463	main
helm check script update and added kube_config variale in digital ocean output.tf	repos/releng/gitlab-cloud-runner!317	sandeeps	main-I255d9f68a857455c8341b50de5a3d1b65451d3a3	main

Event Timeline

• dancy created this task.Jan 10 2024, 5:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2024, 5:14 PM

thcipriani assigned this task to Sandeeps.Jan 10 2024, 6:42 PM

thcipriani triaged this task as Medium priority.

thcipriani edited projects, added Release-Engineering-Team (Priority Backlog 📥); removed Release-Engineering-Team.

• dancy updated the task description. (Show Details)Jan 11 2024, 5:32 PM

• dancy updated the task description. (Show Details)Jan 11 2024, 5:40 PM

Sandeeps changed the task status from Open to In Progress.Jan 18 2024, 9:59 PM

• dancy updated the task description. (Show Details)Jan 19 2024, 9:36 PM

Hi all, I wanted to update regarding the issue. As, I tried reproducing the error and doing Atomic = true setting didn't resolve the problem, and the deployment is still stuck in a locked state. I think we need more investigation on it.

sandeeps updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/317

helm check script update and added kube_config variale in digital ocean output.tf

sandeeps merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/317

helm check script update and added kube_config variale in digital ocean output.tf

sandeeps updated https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/331

add outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.

sandeeps merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/331

add outputs for namespace and kubeconfig in cluster configuration, providing necessary data for cicd operations.

sandeeps opened https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/332

fix mismatch in kubeconfig output variable reference

sandeeps merged https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/332

fix mismatch in kubeconfig output variable reference