Page MenuHomePhabricator

[k8s,infra] Upgrade tools to Uwubernetes 1.30
Closed, ResolvedPublic

Description

Upgrade procedure: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes

Refer to the link above for the detailed procedure, and update the checkboxes as you complete them.

If multiple people are working on the upgrade, you can copy the checklist to an Etherpad for easier collaborative editing.

Use this command from a toolforge control node to quickly generate a list of nodes:

for node in $(kubectl get nodes -o json | jq '.items[].metadata.name' -r); do echo "  - [] $node"; done
  • Run functional tests
  • Add a silence in alertmanager - fb2711f1-8745-4b2a-88c8-1d380fae61f6
  • (only for "tools" cluster) Update IRC topic
  • Run prepare_upgrade cookbook
  • Upgrade control nodes
    • tools-k8s-control-7
    • tools-k8s-control-8
    • tools-k8s-control-9
  • Upgrade worker nodes
    • tools-k8s-worker-102
    • tools-k8s-worker-103
    • tools-k8s-worker-105
    • tools-k8s-worker-106
    • tools-k8s-worker-107
    • tools-k8s-worker-108
    • tools-k8s-worker-109
    • tools-k8s-worker-110
    • tools-k8s-worker-111
    • tools-k8s-worker-112
    • tools-k8s-worker-nfs-1
    • tools-k8s-worker-nfs-10
    • tools-k8s-worker-nfs-11
    • tools-k8s-worker-nfs-12
    • tools-k8s-worker-nfs-13
    • tools-k8s-worker-nfs-14
    • tools-k8s-worker-nfs-16
    • tools-k8s-worker-nfs-17
    • tools-k8s-worker-nfs-19
    • tools-k8s-worker-nfs-2
    • tools-k8s-worker-nfs-21
    • tools-k8s-worker-nfs-22
    • tools-k8s-worker-nfs-23
    • tools-k8s-worker-nfs-24
    • tools-k8s-worker-nfs-26
    • tools-k8s-worker-nfs-27
    • tools-k8s-worker-nfs-3
    • tools-k8s-worker-nfs-32
    • tools-k8s-worker-nfs-33
    • tools-k8s-worker-nfs-34
    • tools-k8s-worker-nfs-35
    • tools-k8s-worker-nfs-36
    • tools-k8s-worker-nfs-37
    • tools-k8s-worker-nfs-38
    • tools-k8s-worker-nfs-39
    • tools-k8s-worker-nfs-40
    • tools-k8s-worker-nfs-41
    • tools-k8s-worker-nfs-42
    • tools-k8s-worker-nfs-43
    • tools-k8s-worker-nfs-44
    • tools-k8s-worker-nfs-45
    • tools-k8s-worker-nfs-46
    • tools-k8s-worker-nfs-47
    • tools-k8s-worker-nfs-48
    • tools-k8s-worker-nfs-5
    • tools-k8s-worker-nfs-50
    • tools-k8s-worker-nfs-53
    • tools-k8s-worker-nfs-54
    • tools-k8s-worker-nfs-55
    • tools-k8s-worker-nfs-57
    • tools-k8s-worker-nfs-58
    • tools-k8s-worker-nfs-61
    • tools-k8s-worker-nfs-65
    • tools-k8s-worker-nfs-66
    • tools-k8s-worker-nfs-67
    • tools-k8s-worker-nfs-68
    • tools-k8s-worker-nfs-69
    • tools-k8s-worker-nfs-7
    • tools-k8s-worker-nfs-70
    • tools-k8s-worker-nfs-71
    • tools-k8s-worker-nfs-72
    • tools-k8s-worker-nfs-73
    • tools-k8s-worker-nfs-74
    • tools-k8s-worker-nfs-75
    • tools-k8s-worker-nfs-76
    • tools-k8s-worker-nfs-77
    • tools-k8s-worker-nfs-78
    • tools-k8s-worker-nfs-79
    • tools-k8s-worker-nfs-8
    • tools-k8s-worker-nfs-80
    • tools-k8s-worker-nfs-81
    • tools-k8s-worker-nfs-82
    • tools-k8s-worker-nfs-9
  • Upgrade ingress nodes
    • tools-k8s-ingress-7
    • tools-k8s-ingress-8
    • tools-k8s-ingress-9
  • Upgrade kubectl on bastions
  • Check everything looks good
  • Remove the silence in alertmanager
  • (only for "tools" cluster) Revert IRC topic change

Event Timeline

dcaro triaged this task as High priority.Aug 20 2025, 2:50 PM
dcaro changed the task status from Open to In Progress.Aug 21 2025, 3:29 PM
dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 23) board.
dcaro moved this task from In Progress to Next Up on the Toolforge (Toolforge iteration 23) board.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:11:52Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.prepare_upgrade for cluster tools upgrade from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:13:31Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.prepare_upgrade (exit_code=0) for cluster tools upgrade from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:16:16Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-control-7 from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:26:56Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-control-7 from 1.29.15 to 1.30.14 (T402378)

Maybe create a small set of tests for running during upgrades instead of the whole of it.

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:28:35Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-control-8 from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:37:37Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-control-8 from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:37:43Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-control-9 from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:47:03Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-control-9 from 1.29.15 to 1.30.14 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:52:29Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade_workers for tools-k8s-worker-102, tools-k8s-worker-103, tools-k8s-worker-105, tools-k8s-worker-106, tools-k8s-worker-107, tools-k8s-worker-108, tools-k8s-worker-109, tools-k8s-worker-110, tools-k8s-worker-111, tools-k8s-worker-112 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T08:58:43Z] <dcaro@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.k8s.worker.upgrade_workers (exit_code=99) for tools-k8s-worker-102, tools-k8s-worker-103, tools-k8s-worker-105, tools-k8s-worker-106, tools-k8s-worker-107, tools-k8s-worker-108, tools-k8s-worker-109, tools-k8s-worker-110, tools-k8s-worker-111, tools-k8s-worker-112 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T09:10:26Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade_workers for tools-k8s-worker-102, tools-k8s-worker-103, tools-k8s-worker-105, tools-k8s-worker-106, tools-k8s-worker-107, tools-k8s-worker-108, tools-k8s-worker-109, tools-k8s-worker-110, tools-k8s-worker-111, tools-k8s-worker-112 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T09:18:12Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade_workers (exit_code=0) for tools-k8s-worker-102, tools-k8s-worker-103, tools-k8s-worker-105, tools-k8s-worker-106, tools-k8s-worker-107, tools-k8s-worker-108, tools-k8s-worker-109, tools-k8s-worker-110, tools-k8s-worker-111, tools-k8s-worker-112 (T402378)

Just noticed that the containers restarts graph in toolsbeta is empty, and in tools is started to be empty after upgrade, probably the metric went away, looking:
https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=toolsbeta&from=now-6h&to=now&timezone=utc&var-cluster_datasource=P6466A70779AF0C39&forceLogin=true&editPanel=4

Note: one of the loki backends failed to drain, it was erroring with timeouts, trying to save the latest writes, we might want to drain the non-nfs nodes more gently, will investigate

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T09:58:54Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-68: (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T09:58:57Z] <wmbot~dcaro@acme> END (ERROR) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=97) for tools-k8s-worker-nfs-68: (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T09:59:00Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-68 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T10:05:50Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-68 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T11:08:10Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade_ingresses for tools-k8s-ingress-7, tools-k8s-ingress-8, tools-k8s-ingress-9 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T11:12:45Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade_ingresses (exit_code=0) for tools-k8s-ingress-7, tools-k8s-ingress-8, tools-k8s-ingress-9 (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T11:13:53Z] <dcaro@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade_bastions for tools-bastion-12.tools.eqiad1.wikimedia.cloud, tools-bastion-13.tools.eqiad1.wikimedia.cloud (T402378)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-08T11:14:13Z] <dcaro@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade_bastions (exit_code=0) for tools-bastion-12.tools.eqiad1.wikimedia.cloud, tools-bastion-13.tools.eqiad1.wikimedia.cloud (T402378)

Just noticed that the containers restarts graph in toolsbeta is empty, and in tools is started to be empty after upgrade, probably the metric went away, looking:
https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&var-cluster=toolsbeta&from=now-6h&to=now&timezone=utc&var-cluster_datasource=P6466A70779AF0C39&forceLogin=true&editPanel=4

This is working actually, the issue is that we were not doing irate on it, and just getting the raw value, so when the workers restart it gets reset to 0.

Added the irate, but now the issue is that there's so little restarts per second that the accuracy gets a bit messed up in per-hour (irate(..[10m]) * 3600 gives a multiple of 60). Better than nothing I guess.

dcaro moved this task from In Progress to Done on the Toolforge (Toolforge iteration 24) board.