Page MenuHomePhabricator

Pods getting stuck in "Terminating" status
Closed, ResolvedPublicBUG REPORT

Description

tools.whois-referral@tools-sgebastion-11:~$ kubectl get po
NAME                              READY   STATUS        RESTARTS   AGE
whois-referral-7c7858b4f5-qbwfb   1/1     Terminating   0          17d
whois-referral-7c7858b4f5-w8fcs   1/1     Terminating   0          10m
tools.whois-referral@tools-sgebastion-11:~$ kubectl describe po whois-referral-7c7858b4f5-qbwfb
(... snip ...)
Events:
  Type     Reason         Age                   From     Message
  ----     ------         ----                  ----     -------
  Normal   Killing        6m9s (x13 over 15m)   kubelet  Stopping container webservice
  Warning  FailedKillPod  4m41s (x14 over 14m)  kubelet  error killing pod: failed to "KillContainer" for "webservice" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: bfeef739d3d3dc83272ff6fa8352cb64be47b49f1526aeec5c4bd01415d37ba4: tried to kill container, but did not receive an exit event"

Event Timeline

According to htop on the worker host, the uwsgi processes seem to be blocked on disk I/O:

image.png (193×2 px, 109 KB)

(When I checked maybe ten minutes ago, there were two more processes that were runnable and consuming about 25% CPU; it looks like they exited in the meantime.) Something similar had happened the other day with wd-image-positions too (IRC log).

Mentioned in SAL (#wikimedia-cloud) [2023-04-27T23:59:33Z] <bd808> kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67 (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:04:01Z] <bd808> Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:07:03Z] <bd808> Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon (T335543)

Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:09:41Z] <bd808> kubectl uncordon tools-k8s-worker-67 (T335543)

When draining tools-k8s-worker-67, the whois-referral pods stayed stuck and also seemed to prevent the soft reboot from taking effect even after being hit with a kill -9 hammer. The hard instance reboot via horizon semi-obviously ended the stuck processes.

This might have the same root issue as T335336 (nfs went away at some point, and it did not recover, leaving stuck processes in uninterruptible state)