Hi folks!
After a long investigation between me Janis and Tobias for https://github.com/ROCm/k8s-device-plugin/issues/65 we ended up finding that the issue is the runc version available for Bullseye. The current issue is a syscall like access, used by Pytorch and other tools when initializing a GPU, ends up returning EPERM when it shouldn't, making the overall Python code erroring out (so we are not able to use the GPU).
I opened https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1071269 to ask if Debian upstream could release a new version for Bullseye, but migrating to Bookworm would also allow ML to progress T363191 so I started checking what is needed to allow Bookworm for Kubernetes workers.
Packages
The following are not available for bookworm-wikimedia:
- kubernetes-node
- calico-cni
- istio-cni
- calico
- calicoctl
- dragonfly-dfdaemon
- dragonfly-dget
Meanwhile rsyslog-kubernetes is now shipped by Debian upstream, so we are good on that side. Everything is ran by golang and statically built, but I checked ldd to verify anyway.
- The kubelet binary links against libc
- kube-proxy, Calico binaries, Istio binaries are marked as not a dynamic executable
- The dragonfly binaries have a longer list of dynamic libraries:
linux-vdso.so.1 (0x00007ffd5c99d000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff1f87d4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff1f85f3000) /lib64/ld-linux-x86-64.so.2 (0x00007ff1f87e0000)
If I am not missing any package I could simply reprepro copy all packages to bookworm-wikimedia and test on ml-staging2001, rebuilding kubernetes-node or dragonfly packages if needed (namely libc incompatibility etc..).
Docker would move to a different version of course:
20.10.5+dfsg1-1+deb11u2 to 20.10.24+dfsg1-1+b320.10.24+dfsg1-1+b3
And runc as well:
1.0.0~rc93+ds1-5+deb11u3 to 1.1.5+ds1-1+deb12u1 (this version contains the fix that ML needs)
Puppet
Checked profile::kubernetes::node and related classes like k8s::kubelet, I don't see any Bullseye-specific bit from a quick glance, so in theory no changes are needed.
Kernel/OS/Misc
The kernel would go from 5.x to 6.x, I can't think about anything specific that would cause troubles except of course that all containers may run differently on 6.x, but we can't do much about it in advance.
Anything else that I am missing? Does the plan look ok?