dbreps job pending to start for 2d16h on Toolforge
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Feb 22 2024, 4:14 AM

Description

tools.dbreps@tools-sgebastion-10:~$ toolforge jobs list
Job name:    Job type:            Status:
-----------  -------------------  -------------------
rusty        schedule: 0 * * * *  Running for 2d16h6m

Turns out it hasn't actually started yet:

tools.dbreps@tools-sgebastion-10:~$ kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP               NODE                      NOMINATED NODE   READINESS GATES
dbreps-84b58fdd58-7blq4   1/1     Running   0          43h     192.168.46.116   tools-k8s-worker-nfs-31   <none>           <none>
rusty-28472400-92lw6      0/1     Pending   0          2d16h   <none>           tools-k8s-worker-nfs-38   <none>           <none>
tools.dbreps@tools-sgebastion-10:~$ kubectl logs rusty-28472400-92lw6
Error from server (BadRequest): container "job" in pod "rusty-28472400-92lw6" is waiting to start:

Why has it been waiting 2 and a half days to start? Note: I've left this as-is, in the broken pending state, in case it's useful for debugging.

Related Objects

Mentioned In: T358179: [wmcs-cookbooks] wmcs.toolforge.add_k8s_node occasionally fails to setup custom Puppetmaster
Mentioned Here: T348755: [jobs-api,webservice] Run webservices via the jobs framework
T358179: [wmcs-cookbooks] wmcs.toolforge.add_k8s_node occasionally fails to setup custom Puppetmaster

Event Timeline

Legoktm created this task.Feb 22 2024, 4:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2024, 4:14 AM

Legoktm renamed this task from dbreps job pending to start for 2d16h to dbreps job pending to start for 2d16h on Toolforge.Feb 22 2024, 4:15 AM

Feb 19 12:14:11 tools-k8s-worker-nfs-38 kubelet[3990]: E0219 12:14:11.504588    3990 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"KillPodSandbox\" for \"dc967ea3-c6f4-4ca2-bf06-66b497e405a3\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"1de0a6c361bd7b7a5762ee3b22c3edf98617adf793e40961e2e38d32d9990282\\\": plugin type=\\\"loopback\\\" failed (delete): failed to find plugin \\\"loopback\\\" in path [/usr/lib/cni]\"" pod="tool-dbreps/rusty-28472400-92lw6" podUID=dc967ea3-c6f4-4ca2-bf06-66b497e405a3

GhostInTheMachine subscribed.Feb 22 2024, 8:07 AM

So the CNI path there is wrong, and our containerd config Puppetization is supposed to change that. That node was affected by T358179: [wmcs-cookbooks] wmcs.toolforge.add_k8s_node occasionally fails to setup custom Puppetmaster, so I think the reason why your pod was affected was that I failed to drain + reboot that node after fixing the certificates.

I'll see if we can alert on pods stuck in Pending for a while.

taavi changed the task status from Open to In Progress.Feb 22 2024, 8:12 AM

taavi triaged this task as High priority.

taavi moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 06) board.

In T358175#9566434, @taavi wrote:

I'll see if we can alert on pods stuck in Pending for a while.

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

0xDeadbeef subscribed.Feb 23 2024, 12:22 AM

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

Does this happen a lot? I would've thought webservice/toolforge jobs would prevent that from happening.

Also, one other thing is that toolforge jobs incorrectly reports the job as "Running for ..." instead of a status of Pending, which you have to use kubectl to see.

In T358175#9570814, @Legoktm wrote:

Not easily, the same Pending status as reported by kube-state-metrics seems to also include things pods where the image configured does not exist and other user errors.

Does this happen a lot? I would've thought webservice/toolforge jobs would prevent that from happening.

There are a few tools currently like this, mostly webservices. The Jobs API does validate it, webservice at the moment does not but that'll be fixed with T348755. And of course people can kubectl directly to create jobs with non-existent images, but hopefully those people know how to check if the image does not exist.

But at least at the moment it's too many to get a meaningful alert that's not mostly triggered by user error.

Also, one other thing is that toolforge jobs incorrectly reports the job as "Running for ..." instead of a status of Pending, which you have to use kubectl to see.

As far as I can tell the job did run successfully, and the error was when it was trying to be removed. But yes, it should report something better in this case. Too bad I didn't save the job YAML before draining the nodes.

dcaro edited projects, added Toolforge (Toolforge iteration 07); removed Toolforge (Toolforge iteration 06).Mar 5 2024, 5:12 PM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 07) board.

taavi closed this task as Resolved.Mar 28 2024, 12:43 PM

taavi moved this task from In Progress to Done on the Toolforge (Toolforge iteration 07) board.Apr 2 2024, 7:53 AM

dbreps job pending to start for 2d16h on ToolforgeClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

dbreps job pending to start for 2d16h on Toolforge
Closed, ResolvedPublic
Actions