Page MenuHomePhabricator

Turn wmcs-k8s-node-upgrade.py into a set of cookbooks
Closed, ResolvedPublic

Description

I'd like to further expand the level of automation on K8s node upgrades, and for that the current script needs to be turned into a cookbook to use all of the tooling we have available.

Event Timeline

Change 947339 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] openstack: wmcs-enc-cli: allow loading data from stdin or file

https://gerrit.wikimedia.org/r/947339

Change 947339 merged by David Caro:

[operations/puppet@production] openstack: wmcs-enc-cli: allow loading data from stdin or file

https://gerrit.wikimedia.org/r/947339

Change 947346 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: k8s: add cookbook to prepare for cluster upgrade

https://gerrit.wikimedia.org/r/947346

Change 951978 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: k8s: add a cookbook to upgrade an individual node

https://gerrit.wikimedia.org/r/951978

Mentioned in SAL (#wikimedia-cloud) [2023-08-30T08:15:00Z] <taavi> updating toolsbeta k8s cluster to 1.23 to test new cookbooks, T298005 T343869

Change 947346 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: k8s: add cookbook to prepare for cluster upgrade

https://gerrit.wikimedia.org/r/947346

Change 951978 merged by jenkins-bot:

[cloud/wmcs-cookbooks@main] toolforge: k8s: add a cookbook to upgrade an individual node

https://gerrit.wikimedia.org/r/951978

Change 953581 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/wmcs-cookbooks@main] toolforge: k8s: worker: upgrade: add SAL messages

https://gerrit.wikimedia.org/r/953581

aborrero triaged this task as Medium priority.Aug 30 2023, 11:22 AM
aborrero added a project: User-aborrero.
aborrero subscribed.

I just ran the new cookbook from my laptop like this:

$ cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T298005 --cluster-name toolsbeta --hostname toolsbeta-test-k8s-worker-9 --src-version 1.22.17 --dst-version 1.23.17
[..]

and it worked just fine.

The only problem was that puppet had been disabled with a different reason, so the cookbook couldn't not re-enable it.
After re-enabling the puppet agent by hand, the next cookbook run was very smooth.

Maybe we can improve the code to detect this in the first place (and maybe prevent any operations?)

Moreover, on the puppet issue. If the first step is to disable puppet fleet-wide, then this definitely needs rethinking.

The complete log of the failure is:

user@laptop:~$ cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T298005 --cluster-name toolsbeta --hostname toolsbeta-test-k8s-worker-9 --src-version 1.22.17 --dst-version 1.23.17
START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node toolsbeta-test-k8s-worker-9 from 1.22.17 to 1.23.17
Using control node toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud
[DOLOGMSG]: toolsbeta-test-k8s-worker-9: upgrading k8s from 1.22.17 to 1.23.17
Draining node toolsbeta-test-k8s-worker-9
----- OUTPUT of 'sudo -i kubectl ...est-k8s-worker-9' -----                                                                                  
node/toolsbeta-test-k8s-worker-9 cordoned                                                                                                    
evicting pod volume-admission/volume-admission-64c68bd9cd-jbq5m                                                                              
evicting pod builds-api/builds-api-6cbb5c7486-hkqnj                                                                                          
evicting pod envvars-api/envvars-api-6cd5ff75bd-d5k2t                                                                                        
evicting pod image-build/test-buildpacks-pipelinerun-dsvsz-build-from-git-pod                                                                
evicting pod image-build/test-buildpacks-pipelinerun-qzlhf-build-from-git-pod                                                                
evicting pod ingress-admission/ingress-admission-bd957fff5-tz5cf                                                                             
evicting pod jobs-emailer/jobs-emailer-f545f54bb-bvmd5                                                                                       
evicting pod kube-system/calico-typha-68bdc84bf9-r889m                                                                                       
WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4cc5d, kube-system/kube-proxy-txm2r, metrics/cadvisor-7lk8v                
evicting pod maintain-kubeusers/maintain-kubeusers-7d5575b78-k7t59                                                                           
evicting pod metrics/kube-state-metrics-6f4c64464d-5hdbc                                                                                     
evicting pod tekton-pipelines/tekton-pipelines-controller-77544bb9dc-76l2p                                                                   
evicting pod tekton-pipelines/tekton-pipelines-webhook-5d899cc8c-vn5zt                                                                       
evicting pod tool-test/test-78c554c857-8whkm                                                                                                 
evicting pod tool-test3/test-scheduler-7bd9947cf7-9kw58                                                                                      
pod/test-buildpacks-pipelinerun-dsvsz-build-from-git-pod evicted                                                                             
pod/test-buildpacks-pipelinerun-qzlhf-build-from-git-pod evicted                                                                             
I0830 11:15:40.335976   32518 request.go:685] Waited for 1.031394338s due to client-side throttling, not priority and fairness, request: GET:https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/envvars-api/pods/envvars-api-6cd5ff75bd-d5k2t                            
pod/envvars-api-6cd5ff75bd-d5k2t evicted                                                                                                     
pod/ingress-admission-bd957fff5-tz5cf evicted                                                                                                
pod/calico-typha-68bdc84bf9-r889m evicted                                                                                                    
pod "tekton-pipelines-controller-77544bb9dc-76l2p" has DeletionTimestamp older than 1 seconds, skipping                                      
pod "tekton-pipelines-webhook-5d899cc8c-vn5zt" has DeletionTimestamp older than 1 seconds, skipping                                          
pod "test-78c554c857-8whkm" has DeletionTimestamp older than 1 seconds, skipping                                                             
pod "builds-api-6cbb5c7486-hkqnj" has DeletionTimestamp older than 1 seconds, skipping                                                       
pod "jobs-emailer-f545f54bb-bvmd5" has DeletionTimestamp older than 1 seconds, skipping                                                      
pod "kube-state-metrics-6f4c64464d-5hdbc" has DeletionTimestamp older than 1 seconds, skipping                                               
pod "maintain-kubeusers-7d5575b78-k7t59" has DeletionTimestamp older than 1 seconds, skipping                                                
pod "test-scheduler-7bd9947cf7-9kw58" has DeletionTimestamp older than 1 seconds, skipping                                                   
pod/volume-admission-64c68bd9cd-jbq5m evicted                                                                                                
node/toolsbeta-test-k8s-worker-9 drained                                                                                                     
================                                                                                                                             
PASS |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:08<00:00,  8.51s/hosts]
FAIL |                                                                                                       |   0% (0/1) [00:08<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i kubectl ...est-k8s-worker-9'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Running Puppet on toolsbeta-test-k8s-worker-9 to pick up updated Apt components
Enabling Puppet with reason "kubernetes upgrade to 1.23.17 - arturo@nostromo" on 1 hosts: toolsbeta-test-k8s-worker-9.toolsbeta.eqiad1.wikimedia.cloud
----- OUTPUT of 'sudo -i enable-p...arturo@nostromo"' -----                                                                                  
Mismatched message, not enabling puppet.                                                                                                     
================                                                                                                                             
PASS |                                                                                                       |   0% (0/1) [00:07<?, ?hosts/s]
FAIL |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00,  7.27s/hosts]
100.0% (1/1) of nodes failed to execute command 'sudo -i enable-p...arturo@nostromo"': toolsbeta-test-k8s-worker-9.toolsbeta.eqiad1.wikimedia.cloud
0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i enable-p...arturo@nostromo"'. Aborting.
0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting.
Exception raised while executing cookbook wmcs.toolforge.k8s.worker.upgrade:
Traceback (most recent call last):
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/_menu.py", line 212, in run
    raw_ret = runner.run()
              ^^^^^^^^^^^^
  File "/home/arturo/git/wmf/cloud/wmcs-cookbooks/wmcs_libs/common.py", line 781, in _wrapped_run
    return object.__getattribute__(self, __name)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arturo/git/wmf/cloud/wmcs-cookbooks/cookbooks/wmcs/toolforge/k8s/worker/upgrade.py", line 182, in run
    puppet.enable(self.spicerack.admin_reason(f"kubernetes upgrade to {self.dst_version}"))
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/puppet.py", line 137, in enable
    self._remote_hosts.run_sync("enable-puppet " + self._puppet_reason(reason, verbatim_reason))
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 496, in run_sync
    return self._execute(
           ^^^^^^^^^^^^^^
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 702, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results())
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node toolsbeta-test-k8s-worker-9 from 1.22.17 to 1.23.17

Change 953588 had a related patch set uploaded (by Majavah; author: Majavah):

[cloud/wmcs-cookbooks@main] toolforge: k8s: use verbatism reasons for puppet disables

https://gerrit.wikimedia.org/r/953588

Change 953588 merged by Arturo Borrero Gonzalez:

[cloud/wmcs-cookbooks@main] toolforge: k8s: use verbatism reasons for puppet disables

https://gerrit.wikimedia.org/r/953588

Change 953581 abandoned by Majavah:

[cloud/wmcs-cookbooks@main] toolforge: k8s: worker: upgrade: add SAL messages

Reason:

https://gerrit.wikimedia.org/r/953581

Change 966864 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] kubeadm: drop version upgrade script

https://gerrit.wikimedia.org/r/966864

Change 966864 merged by Majavah:

[operations/puppet@production] kubeadm: drop version upgrade script

https://gerrit.wikimedia.org/r/966864