Page MenuHomePhabricator

Job not restarting despite liveness probe failures
Closed, ResolvedPublic

Description

The continuous job "itwiki-draftbot-continuous" is currently stuck, and I have no clue why. It doesn't get restarted either.

I ran the following commands in tool-itwiki namespace:

$ toolforge jobs list
        Job name:          |        Job type:        |                 Status:
itwiki-draftbot-continuous |       continuous        |                 Running
$ kubectl get pods
NAME                                          READY   STATUS    RESTARTS      AGE
itwiki-draftbot-continuous-66bc59fc69-xfr65   1/1     Running   2 (15d ago)   21d
$ kubectl top pods
NAME                                          CPU(cores)   MEMORY(bytes)
itwiki-draftbot-continuous-66bc59fc69-xfr65   0m           49Mi
$ kubectl events
LAST SEEN                TYPE      REASON             OBJECT                                            MESSAGE
75s (x1282 over 15d)     Normal    Killing            Pod/itwiki-draftbot-continuous-66bc59fc69-xfr65   Container job failed liveness probe, will be restarted

I'm going to force a rerun tomorrow. Can someone please investigate what happened? Thanks.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Sakretsu sorry for the late reply. Are you still experiencing this issue?

@fnegri no I'm not, but what should I do if this happens again? Do you want me to run some other commands? I may also run a new job with a different name and keep the old one stuck so that you have more time to inspect it.

If you have a reproducer would be great, if not, yes, if it happens again and you can leave it for us to inspect would be great, otherwise something like kubectl get pod deployment/itwiki-draftbot-continuous -o yaml, kubectl describe pod deployment/itwiki-draftbot-continuous -o yaml and kubectl get events -o yaml might be helpful for post-debugging.

Alright, I'll try leaving it in the namespace until you manage to inspect it then. I think this task can be closed for now. I'll reopen it if the issue reoccurs. Thanks!

fnegri claimed this task.

@fnegri @dcaro I have bad news, the job is stuck again. I'll leave it there for you. You can inspect it when you want.

It got unstuck?

tools.itwiki@tools-bastion-15:~$ tail /data/project/itwiki/draftbot/logs/job-cont.err
[2025-09-17T13:19:54Z] b8bcbd30 2025-09-17T13:19:54.308Z Categoria:Controllare - Campania
[2025-09-17T13:19:54Z] a4904f54 2025-09-17T13:19:54.329Z Categoria:Controllare - settembre 2025
[2025-09-17T13:19:55Z] 0ce1c7e5 2025-09-17T13:19:55.745Z Football League Two 2025-2026
[2025-09-17T13:19:57Z] 9d35d3f5 2025-09-17T13:19:57.317Z Grande Dizionario Enciclopedico
[2025-09-17T13:20:04Z] a72ffc8a 2025-09-17T13:20:04.256Z Jacob Karlstrøm
[2025-09-17T13:20:25Z] 0c051c1c 2025-09-17T13:20:25.586Z Brani musicali di Laura Pausini
[2025-09-17T13:20:41Z] 2e56f190 2025-09-17T13:20:41.578Z Utente:Centoventisei/Sandbox/Cristian Vitali
[2025-09-17T13:20:42Z] 0db37d36 2025-09-17T13:20:42.165Z Discussioni progetto:Popular music
[2025-09-17T13:20:47Z] f1a2df86 2025-09-17T13:20:47.765Z Ohlenrode
[2025-09-17T13:20:47Z] 453d61cd 2025-09-17T13:20:47.904Z Template:Calcio Molde rosa

(that also keeps growing currently)

There were a few workers having NFS issues though, and they were restarted, so that might have forced it to restart somewhere else (see T404584: [tools,nfs,infra] Address tools NFS getting stuck with processes in D state if you are interested on low-level details). We are looking on ways to avoid that source of locks and/or automate recovery.

Though please keep notifying us if it happens until we get it sorted for all, it might be a different thing.

The job got stuck again around 2025-09-22T21:20:59Z. I can't say if it's related to the NFS issue.

tools.itwiki@tools-bastion-15:~/draftbot$ kubectl get pod itwiki-draftbot-continuous-76fcff44b5-5q6wc
NAME                                          READY   STATUS        RESTARTS        AGE
itwiki-draftbot-continuous-76fcff44b5-5q6wc   1/1     Terminating   5 (4d19h ago)   16d
tools.itwiki@tools-bastion-15:~/draftbot$ kubectl top pod itwiki-draftbot-continuous-76fcff44b5-5q6wc
NAME                                          CPU(cores)   MEMORY(bytes)
itwiki-draftbot-continuous-76fcff44b5-5q6wc   0m           45Mi

This time k8s events show a different error:

tools.itwiki@tools-bastion-15:~/draftbot/logs$ kubectl events --for pod/itwiki-draftbot-continuous-76fcff44b5-5q6wc
LAST SEEN               TYPE      REASON          OBJECT                                            MESSAGE
16m                     Warning   FailedKillPod   Pod/itwiki-draftbot-continuous-76fcff44b5-5q6wc   error killing pod: [failed to "KillContainer" for "job" with KillContainerError: "rpc error: code = DeadlineExceeded desc = an error occurs during waiting for container \"f3f3cbd653ae0acc5d120d03209220160108642056df75cd23736c09741420ba\" to be killed: wait container \"f3f3cbd653ae0acc5d120d03209220160108642056df75cd23736c09741420ba\": context deadline exceeded", failed to "KillPodSandbox" for "09c0cc8a-812b-45dd-b1c4-6e3b489ca961" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
3m31s (x31 over 139m)   Warning   FailedKillPod   Pod/itwiki-draftbot-continuous-76fcff44b5-5q6wc   error killing pod: [failed to "KillContainer" for "job" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "09c0cc8a-812b-45dd-b1c4-6e3b489ca961" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

I've created another job named itwiki-draftbot-continuous-temp to restart the bot.

https://k8s-status.toolforge.org/namespaces/tool-itwiki/pods/itwiki-draftbot-continuous-76fcff44b5-5q6wc/ shows this on tools-k8s-worker-nfs-73 which doesn't appear generally stuck on NFS at least according to the graphs... but k8s is failing to execute on that node for some reason that will require an admin

Yep, the process got stuck on NFS, restarting, will move the pod to a different node

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-24T16:50:40Z] <wmbot~dcaro@acme> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-73 (T400957)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-09-24T16:57:20Z] <wmbot~dcaro@acme> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-73 (T400957)

taavi subscribed.

David says this is fixed.