Page MenuHomePhabricator

Upgrade the AMD GPU plugin for k8s to support MI300 GPUs
Closed, ResolvedPublic

Description

We should upgrade the AMD GPU plugin to get new patches like https://github.com/ROCm/k8s-device-plugin/pull/117, and support MI300's native partitioning.

Info about the upgrade in https://gerrit.wikimedia.org/g/operations/debs/amd-k8s-device-plugin/+/refs/heads/master

Bonus points: we should think about adding the node labeller as well, to be able to target specific GPU details (how much VRAM, their model, etc..) when targeting a GPU in helmfile deployments (as opposed to just ask for a generic GPU). This may help when MI300x will be available, because we'll likely have different VRAM partitions etc..

Event Timeline

Today I used gbp import-orig -v --merge-mode=replace --pristine-tar ../v1.31.0.8.tar.gz on build2002 to import the latest upstream version, and I've built the debian package without many problems. I had to use --merge-mode=replace because of a merge issue with the regular merger, but IIUC from the diff between commits it did the right thing.

Next steps:

  1. Test the deb on ml-serve1012 and see if the GPU is recognized.
  2. Upgrade the repository, rebuild the package and push it to our internal APT.

I tried to test on ml-serve1012 (manual install of the deb) and this is what I gathered:

  1. hwloc and libhwloc-dev in build depends should not be contrained to 2.9.0, so higher versions are picked up.
  2. the runtime dep libhwloc15 should be installed from debian backports (we probably can do it via puppet).

The plugin didn't start because of the absence of the kubelet (of course).

Change #1184388 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/debs/amd-k8s-device-plugin@master] Release upstream version 1.31.0.8

https://gerrit.wikimedia.org/r/1184388

Change #1184388 merged by Elukey:

[operations/debs/amd-k8s-device-plugin@master] Release upstream version 1.31.0.8

https://gerrit.wikimedia.org/r/1184388

root@apt1002:/srv/wikimedia# sudo -i reprepro lsbycomponent amd-k8s-device-plugin
amd-k8s-device-plugin | 1.25.2.8-1 | bullseye-wikimedia | main | amd64, source
amd-k8s-device-plugin | 1.31.0.8-1 | bookworm-wikimedia | main | amd64, source

Change #1185865 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: support the new AMD GPU k8s plugin

https://gerrit.wikimedia.org/r/1185865

Change #1185865 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: support the new AMD GPU k8s plugin

https://gerrit.wikimedia.org/r/1185865

Deployed on ml-staging, everything looks good. Next steps:

  • Check on staging that scheduling pods with a GPU works as expected, and no horrors are logged in the device plugin's logs.
  • Roll out to prod.

@klausman do you have time to take care of the above?

The ML team deployed edit check in staging requiring a GPU, it got scheduled and I checked this:

root@deploy1003:~# kubectl exec edit-check-predictor-00010-deployment-8699cf6dc7-5xz98 -n edit-check -- ls /dev/dri
card1
renderD128

On ml-staging2001 I don't see any weird logs for the plugin (except expected errors in finding partitions for cpu/memory), and after the deploy I saw:

Sep 09 08:30:37 ml-staging2001 amd-k8s-device-plugin[2970840]: I0909 08:30:37.473099 2970840 plugin.go:373] Allocating device ID: 0000:da:00.0

I think we are ready to deploy this everywhere!

Change #1191699 had a related patch set uploaded (by Klausman; author: Klausman):

[operations/puppet@production] profile::amd_gpu: roll out new AMD GPU plugin to all LiftWing workers

https://gerrit.wikimedia.org/r/1191699

Change #1191699 merged by Klausman:

[operations/puppet@production] profile::amd_gpu: roll out new AMD GPU plugin to all LiftWing workers

https://gerrit.wikimedia.org/r/1191699

This has been rolled out to both eqiad and codfw GPU machines and I restarted our one prod pod that uses GPUs (editcheck). Everything looking good.