Page MenuHomePhabricator

[infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Closed, ResolvedPublic

Description

K8s

https://v1-28.docs.kubernetes.io/blog/2023/08/15/kubernetes-v1-28-release/

Working etherpad: https://etherpad.wikimedia.org/p/k8s-1.27-to-1.28-upgrade
Persistent wiki page: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Upgrading_Kubernetes/1.27_to_1.28_notes

Workgroup page: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Ongoing_Efforts/Toolforge_Upgrade_Workgroup/Upgrades_Overview

Components

Pre-k8s upgrade

can be upgraded (potentially not blocking, tests pass without them upgrading)

Post-k8s upgrade

need upgrading
can be upgraded
  • volume-admission, k8s.io deps to 0.28.X
  • registry-admission, k8s.io deps to 0.28.X
  • ingress-admission, k8s.io deps to 0.28.X
  • envvars-api, k8s.io deps to 0.28.X

Details

Other Assignee
fnegri
Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
volume-admission: bump to 0.0.59-20241120000415-71c39564repos/cloud/toolforge/toolforge-deploy!610ghostbump_volume-admissionmain
[volume-admission] k8s.io deps to 0.28.14repos/cloud/toolforge/volume-admission!21raymond-ndibek8sio_dep_updatemain
envvars-api: bump to 0.0.62-20241119190711-285eda5crepos/cloud/toolforge/toolforge-deploy!609ghostbump_envvars-apimain
registry-admission: bump to 0.0.54-20241119190632-ec34b7a8repos/cloud/toolforge/toolforge-deploy!608ghostbump_registry-admissionmain
ingress-admission: bump to 0.0.54-20241119090358-300b9ae5repos/cloud/toolforge/toolforge-deploy!602ghostbump_ingress-admissionmain
[ingress-admission] k8s.io deps to 0.28.14repos/cloud/toolforge/ingress-admission!14raymond-ndibek8sio_dep_updatemain
[registry-admission] k8s.io deps to 0.28.14repos/cloud/toolforge/registry-admission!17raymond-ndibek8sio_dep_updatemain
[envvars-api] k8s.io deps to 0.28.14repos/cloud/toolforge/envvars-api!48raymond-ndibek8sio_dep_updatemain
[toolforge-deploy] kube-state-metrics v5.16.4 --> v5.18.0repos/cloud/toolforge/toolforge-deploy!586raymond-ndibeupdate_kube_state_metricsmain
[toolforge-deploy] update calico kubeVersionrepos/cloud/toolforge/toolforge-deploy!585raymond-ndibefix_calico_kube_versionmain
[lima-kilo] test k8s 1.28 upgraderepos/cloud/toolforge/lima-kilo!193raymond-ndibetest_1.28main
Show related patches Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:37:47Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-74 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:37:52Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-75 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:38:58Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-75 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:39:01Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-76 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:40:06Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-76 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:40:09Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-8 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:41:17Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-8 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:41:20Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-worker-nfs-9 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:42:31Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-worker-nfs-9 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:44:28Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-7 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:45:28Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-7 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:45:31Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-8 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:46:27Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-8 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:46:30Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.worker.upgrade for node tools-k8s-ingress-9 from 1.27.16 to 1.28.14 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-04T14:47:30Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.worker.upgrade (exit_code=0) for node tools-k8s-ingress-9 from 1.27.16 to 1.28.14 (T362867)

In T378976: Lexeme-forms on Toolforge returns error, a webservice was dead for a few hours until I manually stopped and started it; might be related to this upgrade? (The startup probe continuously got connect: connection refused, apparently.)

might be related to this upgrade?

Potentially yes. Looking at the logs you pasted in T378976, my very vague and unhelpful theory is that it was failing to connect to NFS and access the /data/project/ folder. Manually stopping and starting it probably moved the pod to a different worker node.

Unfortunately I don't think we retain the info on which nodes a pod was scheduled on in the past, if it happens again you can try kubectl get pods -o wide to get the name of the worker node and I can check if there's any NFS issue on that node.

might be related to this upgrade?

Potentially yes. Looking at the logs you pasted in T378976, my very vague and unhelpful theory is that it was failing to connect to NFS and access the /data/project/ folder. Manually stopping and starting it probably moved the pod to a different worker node.

Unfortunately I don't think we retain the info on which nodes a pod was scheduled on in the past, if it happens again you can try kubectl get pods -o wide to get the name of the worker node and I can check if there's any NFS issue on that node.

Found a stray k8s worker that was having trouble with NFS, but was not showing up on the prometheus stats (and the alerts) due to it being in 'confirm migration' status, see T379139: [infra,k8s] node tools-k8s-worker-nfs-24 stopped reporting processes in D state, that might have been one of the issues you were seeing jobs from getting stuck, let's keep an eye in case there's more, but might be sorted now

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-10T02:47:11Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-10T02:47:16Z] <raymond-ndibe@cloudcumin1001> Updating container image docker-registry.tools.wmflabs.org/kube-state-metrics:v2.11.0 (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-10T02:47:22Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.image.copy_to_registry (exit_code=0) (T362867)

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/602

ingress-admission: bump to 0.0.54-20241119090358-300b9ae5

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/608

registry-admission: bump to 0.0.54-20241119190632-ec34b7a8

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/609

envvars-api: bump to 0.0.62-20241119190711-285eda5c

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:16:23Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:17:32Z] <raymond-ndibe@cloudcumin1001> END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:17:45Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:23:19Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:23:58Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:24:13Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:30:12Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:30:46Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component ingress-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:32:16Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:32:52Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:37:45Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:38:12Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:41:41Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:47:22Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component registry-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T19:54:31Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component wmcs-k8s-metrics (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:00:17Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component wmcs-k8s-metrics (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:01:40Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component wmcs-k8s-metrics (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:04:05Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component calico (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:07:52Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component wmcs-k8s-metrics (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:09:45Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component calico (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:12:15Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component calico (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:14:23Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component envvars-api (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:17:07Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component calico (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:20:24Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component envvars-api (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:28:02Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component envvars-api (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:30:03Z] <raymond-ndibe@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.component.deploy (exit_code=99) for component envvars-api (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:31:14Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component envvars-api (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-19T20:31:18Z] <raymond-ndibe@cloudcumin1001> END (ERROR) - Cookbook wmcs.toolforge.component.deploy (exit_code=97) for component envvars-api (T362867)

group_203_bot_4866fc124f4b41659f667468a6115cf3 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/610

volume-admission: bump to 0.0.59-20241120000415-71c39564

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-20T00:09:39Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component volume-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-20T00:15:31Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-20T00:16:18Z] <raymond-ndibe@cloudcumin1001> START - Cookbook wmcs.toolforge.component.deploy for component volume-admission (T362867)

Mentioned in SAL (#wikimedia-cloud-feed) [2024-11-20T00:22:49Z] <raymond-ndibe@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.component.deploy (exit_code=0) for component volume-admission (T362867)

fnegri moved this task from In Progress to Done on the Toolforge (Toolforge iteration 17) board.

We've been running 1.28 for a while, and all subtasks are now resolved, I'm resolving this one as well.