Page MenuHomePhabricator

Experiment with amd-smi and the new AMD GPUs MI300x
Closed, ResolvedPublic

Description

As described in this doc, the amd-smi tool can be used to partition a MI300X GPU. We should try to experiment with it and make sure that it works for our use cases.

High level questions:

  1. What are the available configs that we can set?
  2. Is the amd-smi tool version working for us? Note that Debian upstream packages it only for Trixie, meanwhile we don't have it for Bookworm.
  3. Are the settings preserved after a reboot?

Event Timeline

I tried to follow this guide on ml-serve1012, where Debian Trixie is running and we have amd-smi from Debian upstream.

elukey@ml-serve1012:~$ sudo amd-smi set --memory-partition NPS4 --gpu 0
amdsmi.amdsmi_exception.AmdSmiLibraryException: Error code:
	2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported

The above exception was the direct cause of the following exception:

ValueError: Unable to set memory partition to NPS4 on GPU ID: 0 BDF:0000:05:00.0

elukey@ml-serve1012:~$ sudo amd-smi set --memory-partition NPS8 --gpu 0
amdsmi.amdsmi_exception.AmdSmiLibraryException: Error code:
	2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported

The above exception was the direct cause of the following exception:

ValueError: Unable to set memory partition to NPS8 on GPU ID: 0 BDF:0000:05:00.0
elukey@ml-serve1012:~$ sudo amd-smi set --memory-partition NPS2 --gpu 0
amdsmi.amdsmi_exception.AmdSmiLibraryException: Error code:
	2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported

The above exception was the direct cause of the following exception:

ValueError: Unable to set memory partition to NPS2 on GPU ID: 0 BDF:0000:05:00.0
elukey@ml-serve1012:~$ sudo amd-smi set --memory-partition NPS1 --gpu 0
GPU: 0
    MEMORYPARTITION: Successfully set memory partition to NPS1

elukey@ml-serve1012:~$ sudo amd-smi set --compute-partition CPX --gpu 0
GPU: 0
    COMPUTEPARTITION: Successfully set compute partition to CPX

And overall status:

elukey@ml-serve1012:~$ sudo amd-smi static --partition
GPU: 0
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 1
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 2
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 3
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 4
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 5
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 6
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 7
    PARTITION:
        COMPUTE_PARTITION: CPX
        MEMORY_PARTITION: NPS1

GPU: 8
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 9
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 10
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 11
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 12
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 13
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

GPU: 14
    PARTITION:
        COMPUTE_PARTITION: SPX
        MEMORY_PARTITION: NPS1

I am a bit puzzled about the NPS4 memory partition not being available, but maybe I missed something. Need to figure out if AMDSMI_STATUS_NOT_SUPPORTED come from the GPU or from the amd-smi version (maybe too old?).

From the amd-smi's changelog, we have 6.1.2 in Debian and there is a ton of new changes from the last release. So I guess we should probably try to run a more up-to-date version and see how it goes.

Installed amd-smi-lib 6.3 manually on ml-serve1013 (we had it in our repos) and this is the result:

elukey@ml-serve1013:~$ sudo /opt/rocm-6.3.0/bin/amd-smi set --memory-partition NPS4

          ****** WARNING ******

          Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
          AMD SMI will then attempt to change memory (NPS) partition mode.
          Upon a successful set, AMD SMI will then initiate an action to restart amdgpu driver.
          This action will change all GPU's in the hive to the requested memory (NPS) partition mode.

          Please use this utility with caution.
          
Do you accept these terms? [Y/N] Y

amdsmi.amdsmi_exception.AmdSmiLibraryException: Error code:
        2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported

The above exception was the direct cause of the following exception:

ValueError: Unable to set memory partition to NPS4 on GPU ID: 0 BDF:0000:05:00.0

The strange thing is that now I cannot use the -g parameter to select a specific GPU:

elukey@ml-serve1013:~$ sudo /opt/rocm-6.3.0/bin/amd-smi set --memory-partition NPS4 -g 1
amdsmi.amdsmi_exception.AmdSmiParameterException: Invalid parameter:
Actual type: <class 'NoneType'>
Expected type: <class 'ctypes.c_void_p'>

Downloaded amd-smi-lib and rocm-core 6.4.3 from upstream, installed them on ml-serve1013 but no luck:

elukey@ml-serve1013:~$ sudo /opt/rocm-6.4.3/bin/amd-smi set --memory-partition NPS4     
                                           
          ****** WARNING ******                                                                                                                                               

          Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
          AMD SMI will then attempt to change memory (NPS) partition mode.                                                                                                    
          Upon a successful set, AMD SMI will then initiate an action to restart AMD GPU driver.
          This action will change all GPU's in the hive to the requested memory (NPS) partition mode.

          Please use this utility with caution.
           
Do you accept these terms? [Y/N] Y

GPU: 0
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 1
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 2
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 3
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 4
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 5
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 6
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

GPU: 7
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4

I am now wondering if the Linux driver that we use, included in the 6.12 kernel, is up-to-date enough to allow memory partitioning..

I found https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300x/requirements.html that lists a series of requirements, and the only one that doesn't seem to match (or it is not clear if it matches) is the driver's versions:

AMD GPU Driver: amdgpu-build 2120656 (>= 6.12.12)

elukey@ml-serve1013:~$ sudo /opt/rocm-6.4.3/bin/amd-smi version | grep -o 'amdgpu version: [^|]*'
amdgpu version: Linuxversion6.12.38+deb12-amd64(debian-kernel@lists.debian.org)(x86_64-linux-gnu-gcc-12(Debian12.2.0-14+deb12u1)12.2.0,GNUld(GNUBinutilsforDebian)2.40)#1SMPPREEMPT_DYNAMICDebian6.12.38-1~bpo12+1(2025-07-27)

I found http://repo.radeon.com/amdgpu/6.4.3/ubuntu/pool/main/a/amdgpu-dkms/ that could help.

I tried to install the packages but I ended up in https://github.com/ROCm/ROCm/issues/3036 / https://github.com/ROCm/ROCm/issues/5111:

Consult /var/lib/dkms/amdgpu/6.12.12-2194681.22.04/build/make.log for more information.

checking for module configuration... done
configure: creating ./config.status
config.status: creating config/config.h
Makefile:54: *** dma_resv->seq is missing. exit....  Stop.

So I checked in https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.3/reference/system-requirements.html and it seems that for Debian 12 it is supported only 6.1 (we are using 6.12 from backports to get newer drivers). So at this point I think we could try to downgrade the kernel to 6.1 on ml-serve1013, install the amdgpu-dmks and see if the new driver works as expected.

Edit: 6.1 doesn't work, and as explained in https://github.com/ROCm/ROCm/issues/3036 the 6.12 kernel seems not supported by amdgpu-dkms, so it will be difficult to test it properly :(

I tried to strace amd-smi and it gave some good insights:

newfstatat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", {st_mode=S_IFREG|0444, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", O_RDONLY) = 3
read(3, "NPS1\n", 8191)                 = 5
close(3)                                = 0
newfstatat(AT_FDCWD, "/sys/class/drm/renderD128/device/available_memory_partition", 0x7ffcc5cb7070, 0) = -1 ENOENT (No such file or directory)   <=================================
newfstatat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", {st_mode=S_IFREG|0444, st_size=4096, ...}, 0) = 0
openat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", O_WRONLY|O_CREAT|O_TRUNC, 0666) = -1 EACCES (Permission denied)  <=================================
newfstatat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", {st_mode=S_IFREG|0444, st_size=4096, ...}, 0) = 0 
openat(AT_FDCWD, "/sys/class/drm/renderD128/device/current_memory_partition", O_RDONLY) = 3
read(3, "NPS1\n", 8191)                 = 5
close(3)                                = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 3
newfstatat(3, "", {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}, AT_EMPTY_PATH) = 0
ioctl(3, TCGETS, 0x7ffcc5cb8430)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
write(3, "\n\n\r", 3)                   = 3
write(1, "GPU: 0\n    MEMORY_PARTITION: [AM"..., 116GPU: 0
    MEMORY_PARTITION: [AMDSMI_STATUS_NOT_SUPPORTED] Device does not support setting memory partition to NPS4
elukey@ml-serve1013:/sys/class/drm/renderD128/device$ ls -l *partition*
-r--r--r-- 1 root root 4096 Sep 10 16:11 available_compute_partition
-rw-r--r-- 1 root root 4096 Sep 10 16:11 current_compute_partition
-r--r--r-- 1 root root 4096 Sep 10 16:11 current_memory_partition

The available_memory_partition sysfs entry seems to be added from the 6.13 kernel onward: https://github.com/torvalds/linux/commit/012be6f22c01e25c995c30f1f178ac11820dfb65

It seems also present in ROCm 6.4.3, so finding a way to build amdgpu-dkms could maybe solve the problem?

After a chat with Moritz I installed the 6.16 linux kernel on ml-serve1012 that runs Trixie, using the sid repository (and I also did the same with firmware-amd-graphics for consistency). I then installed amd-smi manually using the ROCm 6.4.3 version, and:

elukey@ml-serve1012:~$ sudo /opt/rocm-6.4.3/bin/amd-smi set --memory-partition NPS4 -g all

          ****** WARNING ******

          Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
          AMD SMI will then attempt to change memory (NPS) partition mode.
          Upon a successful set, AMD SMI will then initiate an action to restart AMD GPU driver.
          This action will change all GPU's in the hive to the requested memory (NPS) partition mode.

          Please use this utility with caution.
          
Do you accept these terms? [Y/N] y

Updating memory partition for GPU: 0 
[█████████████████.......................] 61/140 secs remain
GPU: 0
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 1
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 2
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 3
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 4
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 5
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 6
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 7
    MEMORY_PARTITION: Successfully set memory partition to NPS4
elukey@ml-serve1012:~$ ls /dev/dri/renderD1*
/dev/dri/renderD128  /dev/dri/renderD133  /dev/dri/renderD138  /dev/dri/renderD143  /dev/dri/renderD148  /dev/dri/renderD153  /dev/dri/renderD158  /dev/dri/renderD163  /dev/dri/renderD168  /dev/dri/renderD173  /dev/dri/renderD178  /dev/dri/renderD183  /dev/dri/renderD188
/dev/dri/renderD129  /dev/dri/renderD134  /dev/dri/renderD139  /dev/dri/renderD144  /dev/dri/renderD149  /dev/dri/renderD154  /dev/dri/renderD159  /dev/dri/renderD164  /dev/dri/renderD169  /dev/dri/renderD174  /dev/dri/renderD179  /dev/dri/renderD184  /dev/dri/renderD189
/dev/dri/renderD130  /dev/dri/renderD135  /dev/dri/renderD140  /dev/dri/renderD145  /dev/dri/renderD150  /dev/dri/renderD155  /dev/dri/renderD160  /dev/dri/renderD165  /dev/dri/renderD170  /dev/dri/renderD175  /dev/dri/renderD180  /dev/dri/renderD185  /dev/dri/renderD190
/dev/dri/renderD131  /dev/dri/renderD136  /dev/dri/renderD141  /dev/dri/renderD146  /dev/dri/renderD151  /dev/dri/renderD156  /dev/dri/renderD161  /dev/dri/renderD166  /dev/dri/renderD171  /dev/dri/renderD176  /dev/dri/renderD181  /dev/dri/renderD186  /dev/dri/renderD191
/dev/dri/renderD132  /dev/dri/renderD137  /dev/dri/renderD142  /dev/dri/renderD147  /dev/dri/renderD152  /dev/dri/renderD157  /dev/dri/renderD162  /dev/dri/renderD167  /dev/dri/renderD172  /dev/dri/renderD177  /dev/dri/renderD182  /dev/dri/renderD187
elukey@ml-serve1012:~$ ls /dev/dri/renderD1* | wc -l
64

It is encouraging, but other amd-smi sub commands don't show all the 64 GPUs as expected with their state. It may be only a visualization issue, so more tests need to be done, but it looks promising!

Seeing the partial successes above, I tried playing around a bit today, running rocm-smi and nvtop (the latter was originally nvidia-only, but current versions support AMD/ROCm as well). Unfortunately, both of them hang. Worse, the kernel is extremely unhappy about doing that: dmesg is full of breakage messages (log is attached). The only additional tool that I manage to get to work was this:

$ rocm_agent_enumerator 
gfx000
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
gfx942
$

I will reboot the machine to see if it maybe was only nvtop that did bad stuff, and rocminfo would have been fine.

@klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that rocm-smi is now /opt/rocm-6.4.3/bin/amd-smi, please use that instead of the Debian one. I haven't tried to partition the GPU yet, so the horrors may come afterwards. If you want to do some extra checks before the partitioning lemme know so we can assess if everything works beforehand :)

@klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that rocm-smi is now /opt/rocm-6.4.3/bin/amd-smi, please use that instead of the Debian one. I haven't tried to partition the GPU yet, so the horrors may come afterwards. If you want to do some extra checks before the partitioning lemme know so we can assess if everything works beforehand :)

My current hypothesis is that nvtop is fine with the normal state of things, but can't deal with partitioned GPUs (and then the kernel becomes confused etc). Should be easy enough to prod it.

You also mentioned that 1012 was hanging after a reimage, any idea what was going on there?

@klausman ml-serve1012 is up and running with 6.16 from backports, and nvtop seems to work without horrors in the dmesg. Also please note that rocm-smi is now /opt/rocm-6.4.3/bin/amd-smi, please use that instead of the Debian one. I haven't tried to partition the GPU yet, so the horrors may come afterwards. If you want to do some extra checks before the partitioning lemme know so we can assess if everything works beforehand :)

My current hypothesis is that nvtop is fine with the normal state of things, but can't deal with partitioned GPUs (and then the kernel becomes confused etc). Should be easy enough to prod it.

Ack so I'll try to partition and run nvtop again :)

You also mentioned that 1012 was hanging after a reimage, any idea what was going on there?

I've seen it happening in the past, it may be that sometimes it takes a huge amount of time to boot, because I found the host up and running after a while when I checked.

With the new settings, on ml-serve1012:

elukey@ml-serve1012:~$ sudo /opt/rocm-6.4.3/bin/amd-smi set --memory-partition NPS4  

          ****** WARNING ******

          Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
          AMD SMI will then attempt to change memory (NPS) partition mode.
          Upon a successful set, AMD SMI will then initiate an action to restart AMD GPU driver.
          This action will change all GPU's in the hive to the requested memory (NPS) partition mode.

          Please use this utility with caution.
          
Do you accept these terms? [Y/N] y

Updating memory partition for GPU: 0 
[█████████████████.......................] 62/140 secs remain
GPU: 0
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 1
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 2
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 3
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 4
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 5
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 6
    MEMORY_PARTITION: Successfully set memory partition to NPS4

GPU: 7
    MEMORY_PARTITION: Successfully set memory partition to NPS4

I tried amd-smi static and I found 64 GPUs with the following:

VRAM:
    TYPE: HBM
    VENDOR: UNKNOWN
    SIZE: 24568 MB
    BIT_WIDTH: 8192
    MAX_BANDWIDTH: N/A

And nvtop seems to not lead to kernel errors!

Next steps:

  1. IIUC the GPU can work either in SPX mode (single partition for all cores and memory) or in NPS4/CPX mode (8 partitions of GPU compute e memory, 24GB of VRAM each). Are other modes available? For example, would it be possible to have less partitions and more VRAM each?
  2. Is the partitioning preserved after a reboot?
  3. We should talk with the K8s-sig folks to copy/test the K8s 1.23 bookworm packages in trixie, so we don't have to wait for 1.31's upgrade to test these hosts as k8s workers.

Next steps:

  1. IIUC the GPU can work either in SPX mode (single partition for all cores and memory) or in NPS4/CPX mode (8 partitions of GPU compute e memory, 24GB of VRAM each). Are other modes available? For example, would it be possible to have less partitions and more VRAM each?

It seems that there are no other options to use/test via amd-smi. Maybe something will change in the future as ROCm upgrades, but at the moment the 6.4.x series limits the working modes to two.

  1. Is the partitioning preserved after a reboot?

It is not preserved :(

  1. We should talk with the K8s-sig folks to copy/test the K8s 1.23 bookworm packages in trixie, so we don't have to wait for 1.31's upgrade to test these hosts as k8s workers.

Change #1189807 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add support for AMD MI300X GPUs (kernel and firmwares)

https://gerrit.wikimedia.org/r/1189807

Change #1189815 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: add support for amd-smi from ROCm 6.4.3

https://gerrit.wikimedia.org/r/1189815

Change #1189816 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable ROCm 6.4.3 amd-smi on ml-serve{1012,1013}

https://gerrit.wikimedia.org/r/1189816

Change #1189807 merged by Elukey:

[operations/puppet@production] Add support for AMD MI300X GPUs (kernel and firmwares)

https://gerrit.wikimedia.org/r/1189807

Change #1189815 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: add support for amd-smi from ROCm 6.4.3

https://gerrit.wikimedia.org/r/1189815

Change #1190201 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: refactor the class to allow more use cases

https://gerrit.wikimedia.org/r/1190201

Change #1190585 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add a new insetup role for ML GPU hosts

https://gerrit.wikimedia.org/r/1190585

Change #1190201 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: refactor the class to allow more use cases

https://gerrit.wikimedia.org/r/1190201

Change #1190585 merged by Elukey:

[operations/puppet@production] Add a new insetup role for ML GPU hosts

https://gerrit.wikimedia.org/r/1190585

Change #1189816 merged by Elukey:

[operations/puppet@production] Enable ROCm 6.4.3 amd-smi on ml-serve{1012,1013}

https://gerrit.wikimedia.org/r/1189816

Change #1190989 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: add libdrm-amdgpu1 when running Trixie

https://gerrit.wikimedia.org/r/1190989

Today we found that amd-smi is not a drop-in replacement for rocm-smi when it comes to exporting metrics to Prometheus. We use our own Python wrapper to convert the output of rocm-smi to metrics that we then dump into node-exporter. The commandline parameters etc have massivel changed between the two tools, so we will have to adapt it.

I also took a look at upstream's own metrics exporter, but that is a very big and chunky piece of software that also looks like it depends on upstream's amdgpu (DKMS) driver package.

I have also found the much smaller amd_smi_exporter and will give that a go next.

Change #1190989 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: add libdrm-amdgpu1 when running Trixie

https://gerrit.wikimedia.org/r/1190989

ml-serve1012 and 1013 are now running with Trixie, a 6.16 kernel and up-to-date GPU firmwares. We are also using ROCm 6.5.3 amd-smi to support the GPU partitioning (so we can't rely on the Trixie's version sadly). I tried a reimage of ml-serve1013 and everything was configured nicely, all GPUs are recognized etc.. so the hosts is in a fully working and puppetized state. The hosts are also available for all ml-admins.

Next steps:

  • Add support for the new amd-smi tool's format to the Prometheus GPU exporter.
  • Copy the kubernetes' bookworm packages to trixie-wikimedia, and assign the k8s role to the hosts to see what works and what not. The amd-device-plugin should already be supporting the new GPUs.

Change #1193133 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] prometheus: update the amd-rocm exporter

https://gerrit.wikimedia.org/r/1193133

Next steps:

  • Add support for the new amd-smi tool's format to the Prometheus GPU exporter.
  • Copy the kubernetes' bookworm packages to trixie-wikimedia, and assign the k8s role to the hosts to see what works and what not. The amd-device-plugin should already be supporting the new GPUs.

The above have been completed, I think now it is a matter of testing the gpus with a real workload :)

Change #1193133 merged by Elukey:

[operations/puppet@production] prometheus: update the amd-rocm exporter

https://gerrit.wikimedia.org/r/1193133

After a chat with the AMD folks, it seems that amd-smi supports also the DPX partitioning for compute:

elukey@ml-serve1012:/opt/rocm$ sudo /opt/rocm/bin/amd-smi set -C DPX
GPU: 0
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 1
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 2
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 3
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 4
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 5
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 6
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

GPU: 7
    ACCELERATOR_PARTITION: Successfully set accelerator partition to DPX (profile #1)

I had to manually fix the amd-smi's python parser due to https://github.com/ROCm/amdsmi/issues/132

In this way, I got 16 partitions:

elukey@ml-serve1012:/opt/rocm$ sudo /opt/rocm/bin/amd-smi list
GPU: 0
    BDF: 0000:05:00.0
    UUID: abff74a1-0000-1000-80d1-3c23a9580558
    KFD_ID: 10931
    NODE_ID: 2
    PARTITION_ID: 0

GPU: 1
    BDF: 0000:05:00.1
    UUID: abff74a1-0000-1000-80d1-3c23a9580558
    KFD_ID: 6834
    NODE_ID: 3
    PARTITION_ID: 1

[..]

GPU: 15
    BDF: 0000:e5:00.1
    UUID: 87ff74a1-0000-1000-8000-232c018c9f54
    KFD_ID: 37627
    NODE_ID: 17
    PARTITION_ID: 1

After reading https://instinct.docs.amd.com/projects/amdgpu-docs/en/latest/gpu-partitioning/mi300a/overview.html#iii-cpx-core-partitioned-x-celerator, IIUC the compute partitions will use NPS1 so no memory partition at all. I'll ask clarification about what it means in terms of memory sharing etc..

Really interesting:

NAME                       STATUS                     ROLES    AGE   VERSION    LABELS
ml-serve1012.eqiad.wmnet   Ready,SchedulingDisabled   <none>   15d   v1.23.14   amd.com/gpu.cu-count=152,amd.com/gpu.device-id=74a1,amd.com/gpu.simd-count=608,amd.com/gpu.vram=96G,beta.amd.com/gpu.cu-count.152=16,beta.amd.com/gpu.cu-count=152,beta.amd.com/gpu.device-id.74a1=8,beta.amd.com/gpu.device-id=74a1,beta.amd.com/gpu.simd-count.608=16,beta.amd.com/gpu.simd-count=608,beta.amd.com/gpu.vram.96G=16,beta.amd.com/gpu.vram=96G,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=ml-serve1012.eqiad.wmnet,kubernetes.io/os=linux,node.kubernetes.io/disk-type=ssd,topology.kubernetes.io/region=eqiad,topology.kubernetes.io/zone=row-e9

So I see 16x96GB reported by the node labeller, that is as if we had GPUs properly segmented (also memory wise). It may be an artifact of the compute partitions overlay, with the shared memory being used anyway (so multiple processes acting on the same VRAM could affect each other).

Change #1197602 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos

https://gerrit.wikimedia.org/r/1197602

Change #1197602 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: upgrade trixie hosts to ROCm 7.0.2 repos

https://gerrit.wikimedia.org/r/1197602

To keep archives happy: with rocm 7.0.2, I needed to add:

elukey@ml-serve1012:/usr/lib/x86_64-linux-gnu$ ls -l libdrm_amdgpu.so
lrwxrwxrwx 2 root root 24 Apr  1  2025 libdrm_amdgpu.so -> libdrm_amdgpu.so.1.124.0

To fix dlopen errors when running the tool. Not sure if there is a better fix since those are ubuntu packages, we may need to add the symlink via puppet.

Change #1198428 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2

https://gerrit.wikimedia.org/r/1198428

Change #1198428 merged by Elukey:

[operations/puppet@production] profile::amd_gpu: add link for libdrm_amdgpu with ROCm 7.0.2

https://gerrit.wikimedia.org/r/1198428

Change #1198470 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add MI300X node taints to ml-serve1012

https://gerrit.wikimedia.org/r/1198470

Change #1198470 merged by Elukey:

[operations/puppet@production] Add MI300X node taints to ml-serve1012

https://gerrit.wikimedia.org/r/1198470

Tried to add taints via https://gerrit.wikimedia.org/r/1198470 but IIRC this doesn't work after the kubelet has been registered to the k8s api. I executed the following manually:

root@deploy2002:~# kubectl taint nodes ml-serve1012.eqiad.wmnet dedicated=mi300x-experiments:NoExecute
node/ml-serve1012.eqiad.wmnet tainted
root@deploy2002:~# kubectl taint nodes ml-serve1012.eqiad.wmnet dedicated=mi300x-experiments:NoSchedule
node/ml-serve1012.eqiad.wmnet tainted

Change #1199465 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] prometheus-amd-rocm: fix exporter for ROCm 7.0.2

https://gerrit.wikimedia.org/r/1199465

Created also https://github.com/ROCm/amdsmi/issues/134 since ROCm 7.0.2 seems to have a different GPU usage format when partitions are used.

Change #1199465 merged by Elukey:

[operations/puppet@production] prometheus-amd-rocm: fix exporter for ROCm 7.0.2

https://gerrit.wikimedia.org/r/1199465

Change #1203396 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/deployment-charts@master] aya-llm: fix tolerations and affinity

https://gerrit.wikimedia.org/r/1203396

Change #1203396 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix tolerations and affinity for aya-llm

https://gerrit.wikimedia.org/r/1203396

Change #1203453 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/deployment-charts@master] ml-serve: tweak aya llm mem limits

https://gerrit.wikimedia.org/r/1203453

Change #1203453 merged by Dpogorzelski:

[operations/deployment-charts@master] ml-serve: tweak aya llm mem limits

https://gerrit.wikimedia.org/r/1203453

Change #1215088 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move ml-serve1013 to a ML k8s worker

https://gerrit.wikimedia.org/r/1215088

Change #1215088 merged by Elukey:

[operations/puppet@production] Move ml-serve1013 to a ML k8s worker

https://gerrit.wikimedia.org/r/1215088

ml-serve1013 has been added as k8s worker with the necessary taints to avoid regular pods to run on it by accident.

I think that the last step is to figure out how to handle GPU partitioning settings, especially since they are wiped when the host is rebooted.

Change #1216761 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add missing k8s config for ml-serve1013 to ml-serve-eqiad

https://gerrit.wikimedia.org/r/1216761

Change #1216761 merged by Elukey:

[operations/puppet@production] Add missing k8s config for ml-serve1013 to ml-serve-eqiad

https://gerrit.wikimedia.org/r/1216761

I'll close the task as we currently have ML workloads using these GPUs.
If any followup is required i'll open a specific task for it.
I will track taint removal in another task.