Page MenuHomePhabricator

Make helm upgrades atomic
Closed, ResolvedPublic

Description

When a helm upgrade/install fails, this can not always be fixed with another helm deploy but with a rollback. Unfortunately helm can end up in situations where it is not able to roll back properly which happened to me a lot when the deploy to fix a broken deploy is broken as well. Helm then ends up with some assumption about reality that differs from reality and the only way out of it is purging the chart or tinkering with the tiller configmaps by hand (https://github.com/helm/helm/issues/2437, https://github.com/helm/helm/issues/1844).

Never versions of helm are able to auto-rollback failed deployments (helm upgrade --atomic). Using that, it's way harder (did not happen to me again) to end up in such a situation.

  • Import an up-to-date version of helm
  • deploy / use new helm version
  • deploy / use new tiller version (matching helm version)
  • Let helmfile call helm with --atomic

Event Timeline

JMeybohm created this task.May 11 2020, 4:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2020, 4:50 PM

Change 595591 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm@master] New upstream version 2.16.7

https://gerrit.wikimedia.org/r/595591

Change 595591 merged by JMeybohm:
[operations/debs/helm@master] New upstream version 2.16.7

https://gerrit.wikimedia.org/r/595591

JMeybohm updated the task description. (Show Details)May 13 2020, 7:48 AM

Change 596150 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] raw: Add apiVersion (helm lint), remove appVersion

https://gerrit.wikimedia.org/r/596150

Change 596154 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[integration/config@master] helm-linter: New helm upstream version: v2.16.7

https://gerrit.wikimedia.org/r/596154

JMeybohm claimed this task.May 13 2020, 3:14 PM

Mentioned in SAL (#wikimedia-releng) [2020-05-13T18:43:26Z] <hashar> Building releng/helm-linter:0.2.1 for helm upgrade # T252428

Change 596154 merged by jenkins-bot:
[integration/config@master] helm-linter: New helm upstream version: v2.16.7

https://gerrit.wikimedia.org/r/596154

Change 596266 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Upgrade helm-lint job to use helm 2.16.7

https://gerrit.wikimedia.org/r/596266

Change 596266 merged by jenkins-bot:
[integration/config@master] Upgrade helm-lint job to use helm 2.16.7

https://gerrit.wikimedia.org/r/596266

hashar added a subscriber: hashar.May 13 2020, 7:00 PM

I have updated https://integration.wikimedia.org/ci/job/helm-lint/ to use the new container and hence helm 2.1.6.7.

Production hosts are still on 2.12.2* https://debmonitor.wikimedia.org/packages/helm

contint1001 is the CI server used by the Release Pipeline (Blubber) and can not receive the upgrade since it is still on Jessie. But the service should be migrated to contint2001 (Buster) soonish T224591

Change 596150 merged by jenkins-bot:
[operations/deployment-charts@master] raw: Add apiVersion (helm lint)

https://gerrit.wikimedia.org/r/596150

Thanks @hashar!
I've updates helm on all hosts (including contint1001) after verifying deploys still work okay (despite we still use the older tiller).

JMeybohm updated the task description. (Show Details)May 14 2020, 9:27 AM

Indeed, thank you for taking care of contint1001 / Jessie! :]

Change 597067 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/docker-images/production-images@master] tiller: Upgrade to v2.16.7 on buster

https://gerrit.wikimedia.org/r/597067

Change 597067 merged by JMeybohm:
[operations/docker-images/production-images@master] tiller: Upgrade to v2.16.7 on buster

https://gerrit.wikimedia.org/r/597067

Change 597278 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[integration/config@master] helm-linter: New helm version (2.16.7-2)

https://gerrit.wikimedia.org/r/597278

Change 597278 merged by jenkins-bot:
[integration/config@master] helm-linter: New helm version (2.16.7-2)

https://gerrit.wikimedia.org/r/597278

Change 597307 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: [helm-lint] Switch to new helm version (2.16.7-2)

https://gerrit.wikimedia.org/r/597307

Mentioned in SAL (#wikimedia-releng) [2020-05-19T16:33:03Z] <James_F> Docker: Publishing helm-linter:0.2.2 T252428

Change 597307 merged by jenkins-bot:
[integration/config@master] jjb: [helm-lint] Switch to new helm version (2.16.7-2)

https://gerrit.wikimedia.org/r/597307

Change 598441 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin: update tiller in mathoid namespace to 2.16.7-wmf1

https://gerrit.wikimedia.org/r/598441

Change 598487 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Enable atomic helm upgrades for all service deployments

https://gerrit.wikimedia.org/r/598487

Change 598441 merged by JMeybohm:
[operations/deployment-charts@master] admin: update tiller in mathoid namespace to 2.16.7-wmf1

https://gerrit.wikimedia.org/r/598441

Change 598501 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] admin: update tiller to 2.16.7-wmf1

https://gerrit.wikimedia.org/r/598501

hashar removed a subscriber: hashar.May 25 2020, 6:09 PM

Change 598487 merged by jenkins-bot:
[operations/deployment-charts@master] Enable atomic helm upgrades for all service deployments

https://gerrit.wikimedia.org/r/598487

JMeybohm updated the task description. (Show Details)May 26 2020, 2:31 PM

Change 598501 merged by jenkins-bot:
[operations/deployment-charts@master] admin: update tiller to 2.16.7-wmf1

https://gerrit.wikimedia.org/r/598501

JMeybohm closed this task as Resolved.May 27 2020, 5:57 PM
JMeybohm updated the task description. (Show Details)

tiller has been updated in all clusters and namespaces so this is resolved now

JMeybohm reopened this task as Open.Sep 29 2020, 11:43 AM
JMeybohm triaged this task as Medium priority.
JMeybohm removed a project: Patch-For-Review.

We should use atomic upgrades in helmfile.d/admin as well.

Change 634954 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Enable atomic helm upgrades for admin deployments

https://gerrit.wikimedia.org/r/634954

Change 634954 merged by jenkins-bot:
[operations/deployment-charts@master] Enable atomic helm upgrades for admin deployments

https://gerrit.wikimedia.org/r/634954

JMeybohm closed this task as Resolved.Oct 20 2020, 8:42 AM