Page MenuHomePhabricator

Detect and alert on helm releases in unclean state
Open, Needs TriagePublic

Description

We had multiple occasions of helm releases being in an unclean state, leaving deployers confronted with error messages like:

command "/usr/bin/helm3" exited with non-zero status
STDERR:
   Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

This can happen on ^C during helmfile apply, terminated connections etc. and looks a bit spooky at first as "helm list" will return no releases.

root@deploy1002:~# kube_env admin staging
root@deploy1002:~# helm -n eventstreams list
NAME    NAMESPACE       REVISION        UPDATED STATUS  CHART   APP VERSION
root@deploy1002:~# helm -n eventstreams list --all
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
production      eventstreams    4               2022-03-17 21:26:50.082844301 +0000 UTC pending-upgrade eventstreams-0.4.1                 
root@deploy1002:~# helm -n eventstreams history production
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION      
1               Thu Nov  4 13:14:20 2021        superseded      eventstreams-0.3.3                      Install complete 
2               Thu Jan 27 16:27:35 2022        superseded      eventstreams-0.4.0                      Upgrade complete 
3               Wed Mar  2 18:11:54 2022        deployed        eventstreams-0.4.1                      Upgrade complete 
4               Thu Mar 17 21:26:50 2022        pending-upgrade eventstreams-0.4.1                      Preparing upgrade
root@deploy1002:~# kubectl -n eventstreams get secret --field-selector 'type=helm.sh/release.v1'
NAME                               TYPE                 DATA   AGE
sh.helm.release.v1.production.v1   helm.sh/release.v1   1      223d
sh.helm.release.v1.production.v2   helm.sh/release.v1   1      138d
sh.helm.release.v1.production.v3   helm.sh/release.v1   1      104d
sh.helm.release.v1.production.v4   helm.sh/release.v1   1      89d

It would be nice to alert on such cases, which I think could be identified by periodic runs of something like:

root@deploy1002:~# helm list -A --failed --pending 
NAME            NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
main            eventstreams-internal   4               2022-03-17 21:36:11.869687569 +0000 UTC pending-upgrade eventstreams-0.4.1                 
production      eventstreams            4               2022-03-17 21:26:50.082844301 +0000 UTC pending-upgrade eventstreams-0.4.1

I'm currently not 100% certain that helm does the right thing here, as "helm list -A --failed --pending --superseded" does only list the pending releases as well..might be a bug (or PEBCAK)

If you're coming here because you found yourself in this situation, the way out of it is rollback to the last "deployed" state (revision 3 in the example above), see: https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency

Event Timeline

I did some testing and ended up writing a small prometheus exporter that uses the helm go library to collect metrics about helm releases (successful and failed). Using that we could create prometheus alerting rules for releases in pending states for >10m

Change 806870 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Add helm-state-metrics helm chart

https://gerrit.wikimedia.org/r/806870

Change 806871 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Deploy helm-state-metrics to staging-codfw

https://gerrit.wikimedia.org/r/806871

Change 806879 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/docker-images/production-images@master] Add helm-state-metrics image

https://gerrit.wikimedia.org/r/806879

Change 806888 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/helm-state-metrics@master] Initial commit of helm-state-metrics

https://gerrit.wikimedia.org/r/806888

Change 806889 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/software/helm-state-metrics@master] Add vendor dir

https://gerrit.wikimedia.org/r/806889

Change 806888 merged by JMeybohm:

[operations/software/helm-state-metrics@master] Initial commit of helm-state-metrics

https://gerrit.wikimedia.org/r/806888

Change 806889 merged by JMeybohm:

[operations/software/helm-state-metrics@master] Add vendor dir

https://gerrit.wikimedia.org/r/806889

Change 806879 merged by JMeybohm:

[operations/docker-images/production-images@master] Add helm-state-metrics image

https://gerrit.wikimedia.org/r/806879

Mentioned in SAL (#wikimedia-operations) [2022-06-22T15:00:34Z] <jayme> published docker-registry.discovery.wmnet/helm-state-metrics:0.1.0-1 - T310714

Change 806870 merged by jenkins-bot:

[operations/deployment-charts@master] Add helm-state-metrics helm chart

https://gerrit.wikimedia.org/r/806870

Change 806871 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy helm-state-metrics to staging-codfw

https://gerrit.wikimedia.org/r/806871

Change 807913 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] helm-state-metrics fix containerPort protocol

https://gerrit.wikimedia.org/r/807913

Change 807913 merged by jenkins-bot:

[operations/deployment-charts@master] helm-state-metrics fix containerPort protocol

https://gerrit.wikimedia.org/r/807913

Change 807945 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] helm-state-metrics: Enable on all wikikube

https://gerrit.wikimedia.org/r/807945

Deploying this to staging-codfw raised the average5m list request duration from ~8 to ~100ms (etcd reports avg5m increasing from ~5ms to ~17ms) while going from ~0.27 to ~0.3 req/s
While I did expect an increase, I did not expect it to be that big (we have a prometheus scrape interval of 1m).

Change 807945 merged by jenkins-bot:

[operations/deployment-charts@master] helm-state-metrics: Enable on all wikikube

https://gerrit.wikimedia.org/r/807945

Change 808012 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kubernetes::master: Double the apiserver latency thresholds

https://gerrit.wikimedia.org/r/808012

Change 808012 merged by JMeybohm:

[operations/puppet@production] kubernetes::master: Double the apiserver latency thresholds

https://gerrit.wikimedia.org/r/808012

Change 808019 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] helm-state-metrics: Resource headroom for bigger clusters

https://gerrit.wikimedia.org/r/808019

Change 808019 merged by jenkins-bot:

[operations/deployment-charts@master] helm-state-metrics: Resource headroom for bigger clusters

https://gerrit.wikimedia.org/r/808019