Event Timeline
Hi @DPogorzelski-WMF, in 1220313, we deployed the embeddings inference service to the LiftWing experimental namespace. However, as shown in the paste above, the pod appears to be stuck in a pending state indefinitely. Upon reviewing the events, we found that the failure is due to insufficient GPUs.
IIUC, the experimental namespace is hosted in codfw, which should have access to MI210 GPUs, while the MI300X GPUs are in eqiad. If this understanding is correct, could you help clarify what might be causing this issue or if there's something we might have missed?
ignore my suggestion above, it seems that the mi210 gpus are both taken by revise-tone-task, one running in the revise-tone-task-generator and one in the experimental namespace .
I would suggest to remove revise-tone-task-generator from the experimental namespace since we also have it in it's own namespace on staging. that should free up 1 gpu
Thanks @DPogorzelski-WMF for the help! The embeddings isvc pod is now running successfully in the experimental namespace:
$ kubectl get pods NAME READY STATUS RESTARTS AGE embeddings-predictor-00001-deployment-846d4b7ddd-zq8p5 3/3 Running 0 174m