I'd like to further expand the level of automation on K8s node upgrades, and for that the current script needs to be turned into a cookbook to use all of the tooling we have available.
Description
Details
Related Objects
- Mentioned In
- rCCKB07dbc323937b: toolforge: k8s: use verbatism reasons for puppet disables
rCCKBb0266542cef5: toolforge: k8s: add a cookbook to upgrade an individual node
rCCKBdae4e7fa9cf8: toolforge: k8s: add cookbook to prepare for cluster upgrade
T298005: Upgrade Toolforge Kubernetes to version 1.23
T343330: WMCS cookbooks: provide shared hosts for people without global root privileges - Mentioned Here
- T298005: Upgrade Toolforge Kubernetes to version 1.23
Event Timeline
Change 947339 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] openstack: wmcs-enc-cli: allow loading data from stdin or file
Change 947339 merged by David Caro:
[operations/puppet@production] openstack: wmcs-enc-cli: allow loading data from stdin or file
Change 947346 had a related patch set uploaded (by Majavah; author: Majavah):
[cloud/wmcs-cookbooks@main] toolforge: k8s: add cookbook to prepare for cluster upgrade
Change 951978 had a related patch set uploaded (by Majavah; author: Majavah):
[cloud/wmcs-cookbooks@main] toolforge: k8s: add a cookbook to upgrade an individual node
Mentioned in SAL (#wikimedia-cloud) [2023-08-30T08:15:00Z] <taavi> updating toolsbeta k8s cluster to 1.23 to test new cookbooks, T298005 T343869
Change 947346 merged by jenkins-bot:
[cloud/wmcs-cookbooks@main] toolforge: k8s: add cookbook to prepare for cluster upgrade
Change 951978 merged by jenkins-bot:
[cloud/wmcs-cookbooks@main] toolforge: k8s: add a cookbook to upgrade an individual node
Change 953581 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[cloud/wmcs-cookbooks@main] toolforge: k8s: worker: upgrade: add SAL messages
I just ran the new cookbook from my laptop like this:
$ cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T298005 --cluster-name toolsbeta --hostname toolsbeta-test-k8s-worker-9 --src-version 1.22.17 --dst-version 1.23.17 [..]
and it worked just fine.
The only problem was that puppet had been disabled with a different reason, so the cookbook couldn't not re-enable it.
After re-enabling the puppet agent by hand, the next cookbook run was very smooth.
Maybe we can improve the code to detect this in the first place (and maybe prevent any operations?)
Moreover, on the puppet issue. If the first step is to disable puppet fleet-wide, then this definitely needs rethinking.
The complete log of the failure is:
user@laptop:~$ cookbook wmcs.toolforge.k8s.worker.upgrade --task-id T298005 --cluster-name toolsbeta --hostname toolsbeta-test-k8s-worker-9 --src-version 1.22.17 --dst-version 1.23.17 START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node toolsbeta-test-k8s-worker-9 from 1.22.17 to 1.23.17 Using control node toolsbeta-test-k8s-control-4.toolsbeta.eqiad1.wikimedia.cloud [DOLOGMSG]: toolsbeta-test-k8s-worker-9: upgrading k8s from 1.22.17 to 1.23.17 Draining node toolsbeta-test-k8s-worker-9 ----- OUTPUT of 'sudo -i kubectl ...est-k8s-worker-9' ----- node/toolsbeta-test-k8s-worker-9 cordoned evicting pod volume-admission/volume-admission-64c68bd9cd-jbq5m evicting pod builds-api/builds-api-6cbb5c7486-hkqnj evicting pod envvars-api/envvars-api-6cd5ff75bd-d5k2t evicting pod image-build/test-buildpacks-pipelinerun-dsvsz-build-from-git-pod evicting pod image-build/test-buildpacks-pipelinerun-qzlhf-build-from-git-pod evicting pod ingress-admission/ingress-admission-bd957fff5-tz5cf evicting pod jobs-emailer/jobs-emailer-f545f54bb-bvmd5 evicting pod kube-system/calico-typha-68bdc84bf9-r889m WARNING: ignoring DaemonSet-managed Pods: kube-system/calico-node-4cc5d, kube-system/kube-proxy-txm2r, metrics/cadvisor-7lk8v evicting pod maintain-kubeusers/maintain-kubeusers-7d5575b78-k7t59 evicting pod metrics/kube-state-metrics-6f4c64464d-5hdbc evicting pod tekton-pipelines/tekton-pipelines-controller-77544bb9dc-76l2p evicting pod tekton-pipelines/tekton-pipelines-webhook-5d899cc8c-vn5zt evicting pod tool-test/test-78c554c857-8whkm evicting pod tool-test3/test-scheduler-7bd9947cf7-9kw58 pod/test-buildpacks-pipelinerun-dsvsz-build-from-git-pod evicted pod/test-buildpacks-pipelinerun-qzlhf-build-from-git-pod evicted I0830 11:15:40.335976 32518 request.go:685] Waited for 1.031394338s due to client-side throttling, not priority and fairness, request: GET:https://k8s.toolsbeta.eqiad1.wikimedia.cloud:6443/api/v1/namespaces/envvars-api/pods/envvars-api-6cd5ff75bd-d5k2t pod/envvars-api-6cd5ff75bd-d5k2t evicted pod/ingress-admission-bd957fff5-tz5cf evicted pod/calico-typha-68bdc84bf9-r889m evicted pod "tekton-pipelines-controller-77544bb9dc-76l2p" has DeletionTimestamp older than 1 seconds, skipping pod "tekton-pipelines-webhook-5d899cc8c-vn5zt" has DeletionTimestamp older than 1 seconds, skipping pod "test-78c554c857-8whkm" has DeletionTimestamp older than 1 seconds, skipping pod "builds-api-6cbb5c7486-hkqnj" has DeletionTimestamp older than 1 seconds, skipping pod "jobs-emailer-f545f54bb-bvmd5" has DeletionTimestamp older than 1 seconds, skipping pod "kube-state-metrics-6f4c64464d-5hdbc" has DeletionTimestamp older than 1 seconds, skipping pod "maintain-kubeusers-7d5575b78-k7t59" has DeletionTimestamp older than 1 seconds, skipping pod "test-scheduler-7bd9947cf7-9kw58" has DeletionTimestamp older than 1 seconds, skipping pod/volume-admission-64c68bd9cd-jbq5m evicted node/toolsbeta-test-k8s-worker-9 drained ================ PASS |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:08<00:00, 8.51s/hosts] FAIL | | 0% (0/1) [00:08<?, ?hosts/s] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i kubectl ...est-k8s-worker-9'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Running Puppet on toolsbeta-test-k8s-worker-9 to pick up updated Apt components Enabling Puppet with reason "kubernetes upgrade to 1.23.17 - arturo@nostromo" on 1 hosts: toolsbeta-test-k8s-worker-9.toolsbeta.eqiad1.wikimedia.cloud ----- OUTPUT of 'sudo -i enable-p...arturo@nostromo"' ----- Mismatched message, not enabling puppet. ================ PASS | | 0% (0/1) [00:07<?, ?hosts/s] FAIL |███████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00, 7.27s/hosts] 100.0% (1/1) of nodes failed to execute command 'sudo -i enable-p...arturo@nostromo"': toolsbeta-test-k8s-worker-9.toolsbeta.eqiad1.wikimedia.cloud 0.0% (0/1) success ratio (< 100.0% threshold) for command: 'sudo -i enable-p...arturo@nostromo"'. Aborting. 0.0% (0/1) success ratio (< 100.0% threshold) of nodes successfully executed all commands. Aborting. Exception raised while executing cookbook wmcs.toolforge.k8s.worker.upgrade: Traceback (most recent call last): File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/_menu.py", line 212, in run raw_ret = runner.run() ^^^^^^^^^^^^ File "/home/arturo/git/wmf/cloud/wmcs-cookbooks/wmcs_libs/common.py", line 781, in _wrapped_run return object.__getattribute__(self, __name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/arturo/git/wmf/cloud/wmcs-cookbooks/cookbooks/wmcs/toolforge/k8s/worker/upgrade.py", line 182, in run puppet.enable(self.spicerack.admin_reason(f"kubernetes upgrade to {self.dst_version}")) File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/puppet.py", line 137, in enable self._remote_hosts.run_sync("enable-puppet " + self._puppet_reason(reason, verbatim_reason)) File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 496, in run_sync return self._execute( ^^^^^^^^^^^^^^ File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 702, in _execute raise RemoteExecutionError(ret, "Cumin execution failed", worker.get_results()) spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=99) for node toolsbeta-test-k8s-worker-9 from 1.22.17 to 1.23.17
Change 953588 had a related patch set uploaded (by Majavah; author: Majavah):
[cloud/wmcs-cookbooks@main] toolforge: k8s: use verbatism reasons for puppet disables
Change 953588 merged by Arturo Borrero Gonzalez:
[cloud/wmcs-cookbooks@main] toolforge: k8s: use verbatism reasons for puppet disables
Change 953581 abandoned by Majavah:
[cloud/wmcs-cookbooks@main] toolforge: k8s: worker: upgrade: add SAL messages
Reason:
Change 966864 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] kubeadm: drop version upgrade script
Change 966864 merged by Majavah:
[operations/puppet@production] kubeadm: drop version upgrade script