Pods getting stuck in "Terminating" status
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

None

Authored By

	bd808
	Apr 27 2023, 11:54 PM

Description

tools.whois-referral@tools-sgebastion-11:~$ kubectl get po
NAME                              READY   STATUS        RESTARTS   AGE
whois-referral-7c7858b4f5-qbwfb   1/1     Terminating   0          17d
whois-referral-7c7858b4f5-w8fcs   1/1     Terminating   0          10m
tools.whois-referral@tools-sgebastion-11:~$ kubectl describe po whois-referral-7c7858b4f5-qbwfb
(... snip ...)
Events:
  Type     Reason         Age                   From     Message
  ----     ------         ----                  ----     -------
  Normal   Killing        6m9s (x13 over 15m)   kubelet  Stopping container webservice
  Warning  FailedKillPod  4m41s (x14 over 14m)  kubelet  error killing pod: failed to "KillContainer" for "webservice" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: bfeef739d3d3dc83272ff6fa8352cb64be47b49f1526aeec5c4bd01415d37ba4: tried to kill container, but did not receive an exit event"

Related Objects

Mentioned Here: T335336: [toolschecker] jobs mtime check is flapping

Event Timeline

bd808 created this task.Apr 27 2023, 11:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2023, 11:54 PM

bd808 updated the task description. (Show Details)Apr 27 2023, 11:55 PM

According to htop on the worker host, the uwsgi processes seem to be blocked on disk I/O:

(When I checked maybe ten minutes ago, there were two more processes that were runnable and consuming about 25% CPU; it looks like they exited in the meantime.) Something similar had happened the other day with wd-image-positions too (IRC log).

Mentioned in SAL (#wikimedia-cloud) [2023-04-27T23:59:33Z] <bd808> kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67 (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:04:01Z] <bd808> Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:07:03Z] <bd808> Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:09:41Z] <bd808> kubectl uncordon tools-k8s-worker-67 (T335543)

When draining tools-k8s-worker-67, the whois-referral pods stayed stuck and also seemed to prevent the soft reboot from taking effect even after being hit with a kill -9 hammer. The hard instance reboot via horizon semi-obviously ended the stuck processes.

This might have the same root issue as T335336 (nfs went away at some point, and it did not recover, leaving stuck processes in uninterruptible state)

joanna_borun closed this task as Resolved.Sep 4 2024, 2:16 PM

	F36964999: image.png
	Apr 27 2023, 11:59 PM

Pods getting stuck in "Terminating" statusClosed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

Pods getting stuck in "Terminating" status
Closed, ResolvedPublicBUG REPORT
Actions