Page MenuHomePhabricator

Add support for K8s 1.23 on Trixie
Closed, ResolvedPublic

Description

In the parent task we discovered that the new ML GPU hosts run well on Trixie, but the team is not ready yet to move to k8s 1.31 (that has Trixie's support).

In the last k8s SIG we decided to simply copy the Bookworm packages to Trixie Wikimedia, and try to install a trixie worker (the control plane is not needed).

Event Timeline

These should be the packages to copy over to trixie-wikimedia:

elukey@ml-serve1009:~$ dpkg -l | egrep 'kube|istio|cni'
ii  calico-cni                           3.23.3-1                             amd64        Calico CNI binaries (calico, calico-ipam)
ii  istio-cni                            1.15.7-1                             amd64        Istio CNI binaries (istio-cni, istio-iptables)
ii  kubernetes-node                      1.23.14-5                            amd64        Kubernetes node binaries (kubelet, kube-proxy)

Note: from Bookworm onward we have been using rsyslog-kubernetes from Debian upstream.

Change #1192087 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia

https://gerrit.wikimedia.org/r/1192087

Change #1192087 merged by Elukey:

[operations/puppet@production] aptrepo: add kubernetes 1.23 support to Trixie Wikimedia

https://gerrit.wikimedia.org/r/1192087

Packages copied to the new components in trixie-wikimedia, the next step is to test a kubernetes worker :)

Change #1192856 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Assign the ML K8s worker role to ml-serve1012

https://gerrit.wikimedia.org/r/1192856

Change #1192856 merged by Elukey:

[operations/puppet@production] Assign the ML K8s worker role to ml-serve1012

https://gerrit.wikimedia.org/r/1192856

Change #1192894 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set ml-serve1012 as GPU k8s worker

https://gerrit.wikimedia.org/r/1192894

Change #1192894 merged by Elukey:

[operations/puppet@production] Set ml-serve1012 as GPU k8s worker

https://gerrit.wikimedia.org/r/1192894

I had to copy over some extra packages:

  • calicoctl
  • wikimedia-lvs-server
  • dragonfly-*
  • nerdctl
  • crictl

Everything seems to work, but the most notable issue is that in Trixie cpufrequtils is not available, replaced by linux-cpupower. The swap is not straightforward since the new package doesn't provide a systemd unit (see https://bugs-devel.debian.org/cgi-bin/bugreport.cgi?bug=894906 for more info).

Change #1193023 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cpufrequtils: add support for Trixie

https://gerrit.wikimedia.org/r/1193023

Change #1193023 merged by Elukey:

[operations/puppet@production] cpufrequtils: add support for Trixie

https://gerrit.wikimedia.org/r/1193023

Change #1193053 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cpufrequtils: improve cpupower's config

https://gerrit.wikimedia.org/r/1193053

Change #1193053 merged by Elukey:

[operations/puppet@production] cpufrequtils: improve cpupower's config

https://gerrit.wikimedia.org/r/1193053

elukey claimed this task.

The ML node is up and running, and it seems working fine. I am going to keep testing in the parent task, but for the moment this task should be marked as done!

Change #1193318 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cpufrequtils: use restart for cpupower

https://gerrit.wikimedia.org/r/1193318

Change #1193318 merged by Elukey:

[operations/puppet@production] cpufrequtils: use restart for cpupower

https://gerrit.wikimedia.org/r/1193318

Change #1203500 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] containerd: add cni bin directory config on Trixie

https://gerrit.wikimedia.org/r/1203500

Change #1203500 merged by Elukey:

[operations/puppet@production] containerd: add cni bin directory config on Trixie

https://gerrit.wikimedia.org/r/1203500