Page MenuHomePhabricator

Improved alerts/awareness if helm deployment of a service fails
Closed, ResolvedPublic

Description

(Feel free to close this with "Works as designed", thought I'd file anyway for feedback and as a record of debugging.)

Earlier today, I attempted to deploy a change for the linkrecommendation service (rDEPLOYCHARTS2c400fc0791c: linkrecommendation: Bump version)

We have instructions documenting deployments for the service here https://wikitech.wikimedia.org/wiki/Add_Link#Deployment_2, tl;dr:

  • deploy to staging
    • run the service-checker-swagger command
    • run a diff command comparing output from staging and prod
  • deploy to eqiad
    • service-checker-swagger

What happened:

  • I ran helmfile -e staging -i apply at 14:00 (https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2022-02-28). The command gave no output for a while. This is not uncommon with staging, sometimes it can take a minute or two. However the delay was long enough that my SSH connection cut out.
  • I ran helmfile -e staging -i apply again, which said there was nothing to do (although it did emit some noise to #wikimedia-operations)
  • I assumed the deployment to staging was successful, and ran service-checker-swagger which told me everything was fine.
  • I ran the diff command (diff <(curl -s "https://linkrecommendation.discovery.wmnet:4005/v1/linkrecommendations/wikipedia/cs/Barack_Obama?threshold=0.5&max_recommendations=15" | jq .) <(curl -s "https://staging.svc.eqiad.wmnet:4005/v1/linkrecommendations/wikipedia/cs/Barack_Obama?threshold=0.5&max_recommendations=15" | jq .)) and at this point realized there was a problem. We embed the "application_version" (the git commit hash of the application) in the JSON response, so normally you will see that this line varies in the response. Instead I saw that the response from staging and production was identical.
  • I then checked the service with kube_env linkrecommendation staging and kubectl get pods, and at that point saw "CrashLoopBackOff", which led me to find the issue and file T302716: ImportError: cannot import name 'json' from 'itsdangerous' (/opt/lib/python/site-packages/itsdangerous/__init__.py)

Writing this out, I think the main takeaway is that we should add helmfile -e staging status to our deployment steps. But it would be nice if there was some mechanism to make it more obvious that the deployment failed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I don't seem to be able to deploy a new version of the service now, on staging I see:

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
  Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

This is not uncommon with staging, sometimes it can take a minute or two. However the delay was long enough that my SSH connection cut out.

I'd assume the main problem lies here.
Helm will wait for new pods to come up/get ready (health check wise) and will timeout after 600 seconds (see timeout: at the top of your helmfile.yaml. You may ofc. lower that value if you want your deployments to fail faster). After timeout, helm is instructed to roll back to the previous state ( atomic: true) which in your case might not have completed successfully because your SSH connection was terminated. In general I'd strongly recommend to run everything (especially long running processes) in tmux/screen sessions to prevent processes from being terminated like that.

I don't seem to be able to deploy a new version of the service now, on staging I see:

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
  Error: UPGRADE FAILED: another operation (install/upgrade/rollback) is in progress

Might be fallout of the unclean termination of the last deployment. I'll take a look.

$  kubectl -n linkrecommendation get deployment,po                                              
NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE                                                                        
deployment.apps/linkrecommendation-staging   1/1     1            1           116d                                                                       

NAME                                              READY   STATUS             RESTARTS   AGE                                                              
pod/linkrecommendation-staging-57546d74f7-tzvc5   2/3     CrashLoopBackOff   252        21h                                                              
pod/linkrecommendation-staging-ddbd7b66c-9jnlt    3/3     Running            0          7d22h

$ helm list -a
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
NAME    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
staging linkrecommendation      13              2022-02-28 14:00:26.45383692 +0000 UTC  pending-upgrade linkrecommendation-0.2.3

$ helm history staging
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
REVISION        UPDATED                         STATUS          CHART                           APP VERSION     DESCRIPTION      
4               Tue Jan 25 15:03:42 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
5               Wed Jan 26 09:52:52 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
6               Wed Jan 26 12:09:45 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
7               Wed Jan 26 13:16:39 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
8               Tue Feb  1 13:43:31 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
9               Thu Feb  3 12:47:38 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
10              Thu Feb 10 11:04:05 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
11              Thu Feb 10 11:16:43 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete 
12              Mon Feb 21 10:15:30 2022        deployed        linkrecommendation-0.2.3                        Upgrade complete 
13              Mon Feb 28 14:00:26 2022        pending-upgrade linkrecommendation-0.2.3                        Preparing upgrade

From that I'll assume the release was never defined as failed, so rollback was never triggered (because that's something the client does).

$ helm rollback staging 12
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
Rollback was a success! Happy Helming!

$ helm history staging                             
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config           
REVISION        UPDATED                         STATUS          CHART                           APP VERSION     DESCRIPTION                              
5               Wed Jan 26 09:52:52 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
6               Wed Jan 26 12:09:45 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
7               Wed Jan 26 13:16:39 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
8               Tue Feb  1 13:43:31 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
9               Thu Feb  3 12:47:38 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
10              Thu Feb 10 11:04:05 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
11              Thu Feb 10 11:16:43 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete                         
12              Mon Feb 21 10:15:30 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete                         
13              Mon Feb 28 14:00:26 2022        pending-upgrade linkrecommendation-0.2.3                        Preparing upgrade                        
14              Tue Mar  1 11:12:43 2022        deployed        linkrecommendation-0.2.3                        Rollback to 12

$ kubectl -n linkrecommendation get deployment,po 
NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/linkrecommendation-staging   1/1     1            1           116d

NAME                                             READY   STATUS    RESTARTS   AGE
pod/linkrecommendation-staging-ddbd7b66c-9jnlt   3/3     Running   0          7d22h

@kostajh I think you should be able to deploy again

Mentioned in SAL (#wikimedia-operations) [2022-03-01T11:17:26Z] <jayme> rolled back linkrecommendation staging helm release to revision 12 - T302744

kostajh claimed this task.
$  kubectl -n linkrecommendation get deployment,po                                              
NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE                                                                        
deployment.apps/linkrecommendation-staging   1/1     1            1           116d                                                                       

NAME                                              READY   STATUS             RESTARTS   AGE                                                              
pod/linkrecommendation-staging-57546d74f7-tzvc5   2/3     CrashLoopBackOff   252        21h                                                              
pod/linkrecommendation-staging-ddbd7b66c-9jnlt    3/3     Running            0          7d22h

$ helm list -a
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
NAME    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
staging linkrecommendation      13              2022-02-28 14:00:26.45383692 +0000 UTC  pending-upgrade linkrecommendation-0.2.3

$ helm history staging
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
REVISION        UPDATED                         STATUS          CHART                           APP VERSION     DESCRIPTION      
4               Tue Jan 25 15:03:42 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
5               Wed Jan 26 09:52:52 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
6               Wed Jan 26 12:09:45 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
7               Wed Jan 26 13:16:39 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete 
8               Tue Feb  1 13:43:31 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
9               Thu Feb  3 12:47:38 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
10              Thu Feb 10 11:04:05 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete 
11              Thu Feb 10 11:16:43 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete 
12              Mon Feb 21 10:15:30 2022        deployed        linkrecommendation-0.2.3                        Upgrade complete 
13              Mon Feb 28 14:00:26 2022        pending-upgrade linkrecommendation-0.2.3                        Preparing upgrade

From that I'll assume the release was never defined as failed, so rollback was never triggered (because that's something the client does).

$ helm rollback staging 12
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config
Rollback was a success! Happy Helming!

$ helm history staging                             
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/linkrecommendation-deploy-staging.config           
REVISION        UPDATED                         STATUS          CHART                           APP VERSION     DESCRIPTION                              
5               Wed Jan 26 09:52:52 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
6               Wed Jan 26 12:09:45 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
7               Wed Jan 26 13:16:39 2022        superseded      linkrecommendation-0.2.0                        Upgrade complete                         
8               Tue Feb  1 13:43:31 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
9               Thu Feb  3 12:47:38 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
10              Thu Feb 10 11:04:05 2022        superseded      linkrecommendation-0.2.2                        Upgrade complete                         
11              Thu Feb 10 11:16:43 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete                         
12              Mon Feb 21 10:15:30 2022        superseded      linkrecommendation-0.2.3                        Upgrade complete                         
13              Mon Feb 28 14:00:26 2022        pending-upgrade linkrecommendation-0.2.3                        Preparing upgrade                        
14              Tue Mar  1 11:12:43 2022        deployed        linkrecommendation-0.2.3                        Rollback to 12

$ kubectl -n linkrecommendation get deployment,po 
NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/linkrecommendation-staging   1/1     1            1           116d

NAME                                             READY   STATUS    RESTARTS   AGE
pod/linkrecommendation-staging-ddbd7b66c-9jnlt   3/3     Running   0          7d22h

@kostajh I think you should be able to deploy again

Confirmed that worked. tmux is a good idea, thanks!