Page MenuHomePhabricator

Setup & experiments for MI300x GPUs used for LiftWing
Closed, ResolvedPublic

Description

This task is to collect all the extra setup (over the existing GPU machines) and experiments we've done with the MI300X machines that arrived in mid-2025.

Event Timeline

The ml-serve1012 and ml-serve1013 are the first two eqiad hosts available for a test. Some high level thoughts/notes:

  1. From the provisioning and puppet config perspective, all the work is done so the hosts are ready to be used.
  2. They are currently targeting Debian Bookworm, but we had to use kernel and gpu firmware from backports in order to better support the MI300X.
  3. The main issue with Bookworm is that it doesn't ship amd-smi, the tool that allows a user to configure/partition the MI300X GPUs. We should test how the configs are applied, if they are kept across reboots, how the OS sees them, etc.. Probably testing on Debian Trixie would be good.
  4. Why don't we target Debian Trixie directly? All the k8s packaging for the current version, 1.23, is for Bookworm and Service Ops didn't start the Trixe rebuild/migration yet. They are focusing on upgrading to the new k8s version, 1.31, that will have Trixie support. The ML team will likely have to upgrade their k8s clusters during the next quarters, but it will be a long process and in the meantime it would be great to start testing the MI300X GPUs. We could simply copy the 1.24 k8s packages from bookworm-wikimedia too trixie-wikimedia and adjust puppet for ml-serve1012/13, but it would require some time.

Updates:

  • The two hosts, ml-serve101[2,3], run now on Debian Trixie and the GPUs seem recognized and ususable.
  • They currently run amd-smi from ROCm 7.0.2, and we are able to set various partitioning modes (DPX, CPX, SPX - respectively, 2, 8, 1 partitions total).
  • Some metrics are published to Prometheus, but it seems that we cannot have a lot of granularity when using partitions (some of them like "usage" are reported by amd-smi as N/A).
  • The ml-serve1012 host is a k8s node for ml-serve-eqiad, cordoned from all traffic for precaution (we may want to use Taints to deploy only specific pods to it for testing).
  • We added the node-labeller daemon to all GPU hosts, so now a deployment/pod can be assigned to a specific node based on the GPUs that it offers. It works well with partitioning, at least from a quick first glance.

Change #1202194 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/deployment-charts@master] knative-serving: add podspec features Why: To allow pods to be scheduled on specific nodes

https://gerrit.wikimedia.org/r/1202194

Change #1202665 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):

[operations/deployment-charts@master] ml-services: add aya-llm

https://gerrit.wikimedia.org/r/1202665

Change #1202194 merged by jenkins-bot:

[operations/deployment-charts@master] knative-serving: add podspec features

https://gerrit.wikimedia.org/r/1202194

Change #1202665 merged by Dpogorzelski:

[operations/deployment-charts@master] ml-services: add aya-llm

https://gerrit.wikimedia.org/r/1202665

Change #1205138 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] llm: bump transformers to v4.51.0

https://gerrit.wikimedia.org/r/1205138

Change #1205138 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] llm: bump transformers to v4.51.0

https://gerrit.wikimedia.org/r/1205138

Change #1205163 had a related patch set uploaded (by AikoChou; author: AikoChou):

[operations/deployment-charts@master] ml-services: update llm image

https://gerrit.wikimedia.org/r/1205163

Change #1205163 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update llm image

https://gerrit.wikimedia.org/r/1205163

achou removed DPogorzelski-WMF as the assignee of this task.
achou moved this task from Unsorted to In Progress on the Machine-Learning-Team board.
achou added a subscriber: DPogorzelski-WMF.

This is a great milestone! Thanks a lot for the work Kevin :)

After the last chat on Slack I'd do another quick/little test to see if the AMD GPU plugin works as expected. We'd need to add the amd.com/gpu: 1 tag among the isvc's resource limits to see if the number of available GPUs at the k8s scheduler level decreases accordingly (in the current settings, there are 64 GPUs on ml-serve1012 so I'd expect the count to lower down to 63).

hmmm doesn't seem like it.

Allocated resources:

(Total limits may be over 100 percent, i.e., overcommitted.)
Resource           Requests     Limits
--------           --------     ------
cpu                7450m (2%)   9 (2%)
memory             8870Mi (0%)  9766Mi (0%)
ephemeral-storage  0 (0%)       0 (0%)
hugepages-1Gi      0 (0%)       0 (0%)
hugepages-2Mi      0 (0%)       0 (0%)
amd.com/gpu        1            1

Capacity:

amd.com/gpu:        64
cpu:                384
ephemeral-storage:  143071388Ki
hugepages-1Gi:      0
hugepages-2Mi:      0
memory:             1584856464Ki
pods:               110

Allocatable:

amd.com/gpu:        64
cpu:                338700m
ephemeral-storage:  131854590963
hugepages-1Gi:      0
hugepages-2Mi:      0
memory:             1568014722222080m
pods:               110

yeah I think amd.com/gpu: 1 wasn't added when deploying aya, only tolerations, that would explain the result.. Even if at this point I am not 100% sure how a GPU was added without it. Have we checked if the pod has a GPU mounted as device?

My bad, the GPU is there:

root@deploy2002:~# kubectl exec aya-llm-predictor-00015-deployment-65b4577748-6wh2c -n llm  -- ls /dev/dri
card1
renderD128

And the limits setting too:

root@deploy2002:~# kubectl describe pod aya-llm-predictor-00015-deployment-65b4577748-6wh2c -n llm | grep gpu
      amd.com/gpu:  1
      amd.com/gpu:  1

So something may be off in the amd plugin with MI300X?

Most likely, I'm currently looking at the builder machine so will come back to this

perhaps this is relevant:

journalctl -u kubelet --since "10 days ago" | grep -i "amd.com/gpu\|allocate\|device"

Dec 02 13:38:38 ml-serve1012 kubelet[252692]: W1202 13:38:38.185483  252692 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/amd.com_gpu /var/lib/kubelet/device-plugins/amd.com_gpu <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/amd.com_gpu: connect: connection refused". Reconnecting...

everything. else seems correct to me, the pod requested it and seemingly got it:

cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint

"PodDeviceEntries":
        [
          {
            "PodUID": "0f9fde45-55c0-4aa5-9e27-68fbf29c1516",
            "ContainerName": "kserve-container",
            "ResourceName": "amd.com/gpu",
            "DeviceIDs": { "0": ["0000:05:00.0"] },
            "AllocResp": "GhgKCC9kZXYva2ZkEggvZGV2L2tmZBoCcncaJAoOL2Rldi9kcmkvY2FyZDESDi9kZXYvZHJpL2NhcmQxGgJydxouChMvZGV2L2RyaS9yZW5kZXJEMTI4EhMvZGV2L2RyaS9yZW5kZXJEMTI4GgJydw==",
          },
          {
            "PodUID": "58793edd-f673-4b92-98d9-4fcb964988f0",
            "ContainerName": "kserve-container",
            "ResourceName": "amd.com/gpu",
            "DeviceIDs": { "0": ["amdgpu_xcp_0"] },
            "AllocResp": "GhgKCC9kZXYva2ZkEggvZGV2L2tmZBoCcncaJAoOL2Rldi9kcmkvY2FyZDISDi9kZXYvZHJpL2NhcmQyGgJydxouChMvZGV2L2RyaS9yZW5kZXJEMTI5EhMvZGV2L2RyaS9yZW5kZXJEMTI5GgJydw==",
          },
        ],

The above could be false positive, might be happening when the plugin is restarted

I checked the Allocated resources for ml-serve1009, where we run the revise-tone-task pod on a GPU, and I see the following:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                57450m (65%)       96500m (110%)
  memory             72898759168 (18%)  92259666432 (23%)
  ephemeral-storage  0 (0%)             0 (0%)
  hugepages-1Gi      0 (0%)             0 (0%)
  hugepages-2Mi      0 (0%)             0 (0%)
  amd.com/gpu        2                  2

We have two GPUs on it, and two pods are using both of them (edit-check and revise-tone-task). So allocated works fine (same for the MI300X use case), but Allocatable doesn't as well:

Allocatable:
  amd.com/gpu:        2
  cpu:                87400m
  ephemeral-storage:  117299717950
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             387030360770560m
  pods:               110

Maybe this is only a weird way for the scheduler to show the GPU values? In theory scheduling a new pod requesting a GPU on ml-serve1009 should not work, I am pretty sure we have seen in the past deployment being held in a limbo for the absence of a GPU.

mmm but is allocatable something that varies dynamically? Probably not, if so everything seems working fine. Or am I missing anything?