Page MenuHomePhabricator

dbreps job pending to start for 2d16h on Toolforge
Closed, ResolvedPublic

Description

tools.dbreps@tools-sgebastion-10:~$ toolforge jobs list
Job name:    Job type:            Status:
-----------  -------------------  -------------------
rusty        schedule: 0 * * * *  Running for 2d16h6m

Turns out it hasn't actually started yet:

tools.dbreps@tools-sgebastion-10:~$ kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP               NODE                      NOMINATED NODE   READINESS GATES
dbreps-84b58fdd58-7blq4   1/1     Running   0          43h     192.168.46.116   tools-k8s-worker-nfs-31   <none>           <none>
rusty-28472400-92lw6      0/1     Pending   0          2d16h   <none>           tools-k8s-worker-nfs-38   <none>           <none>
tools.dbreps@tools-sgebastion-10:~$ kubectl logs rusty-28472400-92lw6
Error from server (BadRequest): container "job" in pod "rusty-28472400-92lw6" is waiting to start:

Why has it been waiting 2 and a half days to start? Note: I've left this as-is, in the broken pending state, in case it's useful for debugging.

Event Timeline

Legoktm renamed this task from dbreps job pending to start for 2d16h to dbreps job pending to start for 2d16h on Toolforge.Feb 22 2024, 4:15 AM
taavi subscribed.
Feb 19 12:14:11 tools-k8s-worker-nfs-38 kubelet[3990]: E0219 12:14:11.504588    3990 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"dc967ea3-c6f4-4ca2-bf06-66b497e405a3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"1de0a6c361bd7b7a5762ee3b22c3edf98617adf793e40961e2e38d32d9990282\\\": plugin type=\\\"loopback\\\" failed (delete): failed to find plugin \\\"loopback\\\" in path [/usr/lib/cni]\"" pod="tool-dbreps/rusty-28472400-92lw6" podUID=dc967ea3-c6f4-4ca2-bf06-66b497e405a3

So the CNI path there is wrong, and our containerd config Puppetization is supposed to change that. That node was affected by T358179: [wmcs-cookbooks] wmcs.toolforge.add_k8s_node occasionally fails to setup custom Puppetmaster, so I think the reason why your pod was affected was that I failed to drain + reboot that node after fixing the certificates.

I'll see if we can alert on pods stuck in Pending for a while.

taavi changed the task status from Open to In Progress.Feb 22 2024, 8:12 AM
taavi triaged this task as High priority.
taavi moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 06) board.

I'll see if we can alert on pods stuck in Pending for a while.

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

Does this happen a lot? I would've thought webservice/toolforge jobs would prevent that from happening.

Also, one other thing is that toolforge jobs incorrectly reports the job as "Running for ..." instead of a status of Pending, which you have to use kubectl to see.

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

Does this happen a lot? I would've thought webservice/toolforge jobs would prevent that from happening.

There are a few tools currently like this, mostly webservices. The Jobs API does validate it, webservice at the moment does not but that'll be fixed with T348755. And of course people can kubectl directly to create jobs with non-existent images, but hopefully those people know how to check if the image does not exist.

But at least at the moment it's too many to get a meaningful alert that's not mostly triggered by user error.

Also, one other thing is that toolforge jobs incorrectly reports the job as "Running for ..." instead of a status of Pending, which you have to use kubectl to see.

As far as I can tell the job did run successfully, and the error was when it was trying to be removed. But yes, it should report something better in this case. Too bad I didn't save the job YAML before draining the nodes.