This task is to collect all the extra setup (over the existing GPU machines) and experiments we've done with the MI300X machines that arrived in mid-2025.
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | DPogorzelski-WMF | T403599 Setup & experiments for MI300x GPUs used for LiftWing | |||
| Resolved | klausman | T398600 Upgrade the AMD GPU plugin for k8s to support MI300 GPUs | |||
| Resolved | DPogorzelski-WMF | T403697 Experiment with amd-smi and the new AMD GPUs MI300x | |||
| Resolved | elukey | T405891 Add support for K8s 1.23 on Trixie | |||
| Resolved | klausman | T373806 Investigate Label functionality of AMD GPU device plugin on k8s | |||
| Resolved | kevinbazira | T410906 Update Aya LLM model-server to run on LiftWing GPUs |
Event Timeline
The ml-serve1012 and ml-serve1013 are the first two eqiad hosts available for a test. Some high level thoughts/notes:
- From the provisioning and puppet config perspective, all the work is done so the hosts are ready to be used.
- They are currently targeting Debian Bookworm, but we had to use kernel and gpu firmware from backports in order to better support the MI300X.
- The main issue with Bookworm is that it doesn't ship amd-smi, the tool that allows a user to configure/partition the MI300X GPUs. We should test how the configs are applied, if they are kept across reboots, how the OS sees them, etc.. Probably testing on Debian Trixie would be good.
- Why don't we target Debian Trixie directly? All the k8s packaging for the current version, 1.23, is for Bookworm and Service Ops didn't start the Trixe rebuild/migration yet. They are focusing on upgrading to the new k8s version, 1.31, that will have Trixie support. The ML team will likely have to upgrade their k8s clusters during the next quarters, but it will be a long process and in the meantime it would be great to start testing the MI300X GPUs. We could simply copy the 1.24 k8s packages from bookworm-wikimedia too trixie-wikimedia and adjust puppet for ml-serve1012/13, but it would require some time.
Updates:
- The two hosts, ml-serve101[2,3], run now on Debian Trixie and the GPUs seem recognized and ususable.
- They currently run amd-smi from ROCm 7.0.2, and we are able to set various partitioning modes (DPX, CPX, SPX - respectively, 2, 8, 1 partitions total).
- Some metrics are published to Prometheus, but it seems that we cannot have a lot of granularity when using partitions (some of them like "usage" are reported by amd-smi as N/A).
- The ml-serve1012 host is a k8s node for ml-serve-eqiad, cordoned from all traffic for precaution (we may want to use Taints to deploy only specific pods to it for testing).
- We added the node-labeller daemon to all GPU hosts, so now a deployment/pod can be assigned to a specific node based on the GPUs that it offers. It works well with partitioning, at least from a quick first glance.
Change #1202194 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):
[operations/deployment-charts@master] knative-serving: add podspec features Why: To allow pods to be scheduled on specific nodes
Change #1202665 had a related patch set uploaded (by Dpogorzelski; author: Dpogorzelski):
[operations/deployment-charts@master] ml-services: add aya-llm
Change #1202194 merged by jenkins-bot:
[operations/deployment-charts@master] knative-serving: add podspec features
Change #1202665 merged by Dpogorzelski:
[operations/deployment-charts@master] ml-services: add aya-llm
Change #1205138 had a related patch set uploaded (by AikoChou; author: AikoChou):
[machinelearning/liftwing/inference-services@main] llm: bump transformers to v4.51.0
Change #1205138 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] llm: bump transformers to v4.51.0
Change #1205163 had a related patch set uploaded (by AikoChou; author: AikoChou):
[operations/deployment-charts@master] ml-services: update llm image
Change #1205163 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update llm image
In T410906#11415517, we successfully tested the llm model-server on LiftWing with MI300X GPU.
This is a great milestone! Thanks a lot for the work Kevin :)
After the last chat on Slack I'd do another quick/little test to see if the AMD GPU plugin works as expected. We'd need to add the amd.com/gpu: 1 tag among the isvc's resource limits to see if the number of available GPUs at the k8s scheduler level decreases accordingly (in the current settings, there are 64 GPUs on ml-serve1012 so I'd expect the count to lower down to 63).
hmmm doesn't seem like it.
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 7450m (2%) 9 (2%) memory 8870Mi (0%) 9766Mi (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) amd.com/gpu 1 1
Capacity:
amd.com/gpu: 64 cpu: 384 ephemeral-storage: 143071388Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1584856464Ki pods: 110
Allocatable:
amd.com/gpu: 64 cpu: 338700m ephemeral-storage: 131854590963 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 1568014722222080m pods: 110
yeah I think amd.com/gpu: 1 wasn't added when deploying aya, only tolerations, that would explain the result.. Even if at this point I am not 100% sure how a GPU was added without it. Have we checked if the pod has a GPU mounted as device?
My bad, the GPU is there:
root@deploy2002:~# kubectl exec aya-llm-predictor-00015-deployment-65b4577748-6wh2c -n llm -- ls /dev/dri card1 renderD128
And the limits setting too:
root@deploy2002:~# kubectl describe pod aya-llm-predictor-00015-deployment-65b4577748-6wh2c -n llm | grep gpu
amd.com/gpu: 1
amd.com/gpu: 1So something may be off in the amd plugin with MI300X?
perhaps this is relevant:
journalctl -u kubelet --since "10 days ago" | grep -i "amd.com/gpu\|allocate\|device"
Dec 02 13:38:38 ml-serve1012 kubelet[252692]: W1202 13:38:38.185483 252692 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/device-plugins/amd.com_gpu /var/lib/kubelet/device-plugins/amd.com_gpu <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/device-plugins/amd.com_gpu: connect: connection refused". Reconnecting...everything. else seems correct to me, the pod requested it and seemingly got it:
cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
"PodDeviceEntries":
[
{
"PodUID": "0f9fde45-55c0-4aa5-9e27-68fbf29c1516",
"ContainerName": "kserve-container",
"ResourceName": "amd.com/gpu",
"DeviceIDs": { "0": ["0000:05:00.0"] },
"AllocResp": "GhgKCC9kZXYva2ZkEggvZGV2L2tmZBoCcncaJAoOL2Rldi9kcmkvY2FyZDESDi9kZXYvZHJpL2NhcmQxGgJydxouChMvZGV2L2RyaS9yZW5kZXJEMTI4EhMvZGV2L2RyaS9yZW5kZXJEMTI4GgJydw==",
},
{
"PodUID": "58793edd-f673-4b92-98d9-4fcb964988f0",
"ContainerName": "kserve-container",
"ResourceName": "amd.com/gpu",
"DeviceIDs": { "0": ["amdgpu_xcp_0"] },
"AllocResp": "GhgKCC9kZXYva2ZkEggvZGV2L2tmZBoCcncaJAoOL2Rldi9kcmkvY2FyZDISDi9kZXYvZHJpL2NhcmQyGgJydxouChMvZGV2L2RyaS9yZW5kZXJEMTI5EhMvZGV2L2RyaS9yZW5kZXJEMTI5GgJydw==",
},
],I checked the Allocated resources for ml-serve1009, where we run the revise-tone-task pod on a GPU, and I see the following:
Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 57450m (65%) 96500m (110%) memory 72898759168 (18%) 92259666432 (23%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) amd.com/gpu 2 2
We have two GPUs on it, and two pods are using both of them (edit-check and revise-tone-task). So allocated works fine (same for the MI300X use case), but Allocatable doesn't as well:
Allocatable: amd.com/gpu: 2 cpu: 87400m ephemeral-storage: 117299717950 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 387030360770560m pods: 110
Maybe this is only a weird way for the scheduler to show the GPU values? In theory scheduling a new pod requesting a GPU on ml-serve1009 should not work, I am pretty sure we have seen in the past deployment being held in a limbo for the absence of a GPU.
mmm but is allocatable something that varies dynamically? Probably not, if so everything seems working fine. Or am I missing anything?