Page MenuHomePhabricator

Helm was left in limbo due to interrupted deployment/rollback
Closed, ResolvedPublic

Description

While a deployer was deploying, their network went down, which coincided with a Typha outage. After calico and typha were restored, the deployer attempted to run scap, which resulted to P59348

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/mw-parsoid-deploy-eqiad.config
  Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Checking into helms history, seems like we have a pending-rollback.

root@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-web# helm -n mw-parsoid history main
REVISION        UPDATED                         STATUS                  CHART                   APP VERSION     DESCRIPTION
219             Tue Apr  2 08:46:17 2024        superseded              mediawiki-0.6.23                        Upgrade comp
220             Tue Apr  2 09:12:56 2024        superseded              mediawiki-0.6.23                        Upgrade comp
221             Tue Apr  2 13:08:10 2024        superseded              mediawiki-0.6.23                        Upgrade comp
222             Tue Apr  2 13:27:56 2024        superseded              mediawiki-0.6.23                        Upgrade comp
223             Tue Apr  2 15:52:18 2024        superseded              mediawiki-0.6.23                        Upgrade comp
224             Wed Apr  3 08:10:30 2024        superseded              mediawiki-0.6.23                        Upgrade comp
225             Wed Apr  3 08:24:23 2024        superseded              mediawiki-0.6.23                        Upgrade comp
226             Wed Apr  3 09:56:57 2024        superseded              mediawiki-0.6.23                        Upgrade comp
227             Wed Apr  3 10:10:47 2024        deployed                mediawiki-0.6.23                        Upgrade comp
228             Wed Apr  3 13:04:44 2024        failed                  mediawiki-0.6.23                        Upgrade "mai
229             Wed Apr  3 13:14:47 2024        pending-rollback        mediawiki-0.6.23

Details

TitleReferenceAuthorSource BranchDest Branch
kubernetes: fix helm release in all known pending-* statesrepos/releng/scap!267dancymaster-I598f2f2aa19c0cf21c9a1bfcb9562266eb30c82bmaster
helm-check.py: Handle pending-rollback state toorepos/releng/gitlab-cloud-runner!375dancymain-Ied9913316a0793ce9370167d1799281b3e62bfe8main
Customize query in GitLab

Event Timeline

We depooled mw-web-ro from eqiad, and attempted a rollback

root@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-web# helm -n mw-web rollback main 2208
Rollback was a success! Happy Helming!

Repooled and went on rolling back all the failed deployments. Then in turn we run again scap to bring it all in a consistent state

jijiki claimed this task.
jijiki triaged this task as Unbreak Now! priority.
jijiki added a project: serviceops.
dancy subscribed.

Scap already has a check for helm releases in pending-upgrade state but it looks like it needs to handle pending-rollback too.

dancy lowered the priority of this task from Unbreak Now! to Medium.Apr 4 2024, 8:19 PM

Scap 4.75.0 deployed with the fix.