Page MenuHomePhabricator
Paste P86741

embeddings isvc deployment in experimental ns failing because of insufficient GPUs
ActivePublic

Authored by kevinbazira on Dec 22 2025, 9:41 AM.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
embeddings-predictor-00001-deployment-846d4b7ddd-zq8p5 0/3 Pending 0 4m1s
$ kubectl describe pod embeddings-predictor-00001-deployment-846d4b7ddd-zq8p5
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 4m7s default-scheduler 0/5 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate, 3 Insufficient amd.com/gpu.
Warning FailedScheduling 2m (x1 over 3m) default-scheduler 0/5 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate, 3 Insufficient amd.com/gpu.
Warning FailedScheduling 47s default-scheduler 0/5 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate, 3 Insufficient amd.com/gpu.

Event Timeline

Hi @DPogorzelski-WMF, in 1220313, we deployed the embeddings inference service to the LiftWing experimental namespace. However, as shown in the paste above, the pod appears to be stuck in a pending state indefinitely. Upon reviewing the events, we found that the failure is due to insufficient GPUs.

IIUC, the experimental namespace is hosted in codfw, which should have access to MI210 GPUs, while the MI300X GPUs are in eqiad. If this understanding is correct, could you help clarify what might be causing this issue or if there's something we might have missed?

i think you can try to remove amd.com/gpu: "1"

ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu

ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu

Thanks @DPogorzelski-WMF for the help! The embeddings isvc pod is now running successfully in the experimental namespace:

$ kubectl get pods
NAME                                                              READY   STATUS    RESTARTS   AGE
embeddings-predictor-00001-deployment-846d4b7ddd-zq8p5            3/3     Running   0          174m