If helm is interrupted in the middle of a deployment, it can end up in a state where subsequent deployments fail due to the prior deployment being in 'pending-upgrade' state. Terraform gets confused by this state, for example:
Terraform has been successfully initialized! module.k8s-pvc-cleaner.helm_release.this: Creating... ╷ │ Error: cannot re-use a name that is still in use │ │ with module.k8s-pvc-cleaner.helm_release.this, │ on k8s-pvc-cleaner/main.tf line 1, in resource "helm_release" "this": │ 1: resource "helm_release" "this" { │ ╵
The way to recover from this state is to "helm rollback" the release in question.
Proposal:
In https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/.gitlab-ci.yml?ref_type=heads, template .deploy, before running terraform-init, etc, run helm -A list --pending to get a list of all pending helm releases in all namespaces, then roll back each of them using helm rollback <release>.
Update: There's a chicken and egg issue with this proposal since helm accesses the kubernetes cluster which requires the kubeconfig information that is established by terraform.
- Reproduce the problem state by preparing a helm_release resource with a bad image reference, deploying it, and terminating the deployment job when it hangs.
- See if we can avoid this situation altogether by passing atomic: true to the helm_release resource.
xref: https://github.com/hashicorp/terraform-provider-helm/issues/425