tools.whois-referral@tools-sgebastion-11:~$ kubectl get po NAME READY STATUS RESTARTS AGE whois-referral-7c7858b4f5-qbwfb 1/1 Terminating 0 17d whois-referral-7c7858b4f5-w8fcs 1/1 Terminating 0 10m tools.whois-referral@tools-sgebastion-11:~$ kubectl describe po whois-referral-7c7858b4f5-qbwfb (... snip ...) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Killing 6m9s (x13 over 15m) kubelet Stopping container webservice Warning FailedKillPod 4m41s (x14 over 14m) kubelet error killing pod: failed to "KillContainer" for "webservice" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: bfeef739d3d3dc83272ff6fa8352cb64be47b49f1526aeec5c4bd01415d37ba4: tried to kill container, but did not receive an exit event"
Description
Related Objects
- Mentioned Here
- T335336: [toolschecker] jobs mtime check is flapping
Event Timeline
According to htop on the worker host, the uwsgi processes seem to be blocked on disk I/O:
(When I checked maybe ten minutes ago, there were two more processes that were runnable and consuming about 25% CPU; it looks like they exited in the meantime.) Something similar had happened the other day with wd-image-positions too (IRC log).
Mentioned in SAL (#wikimedia-cloud) [2023-04-27T23:59:33Z] <bd808> kubectl drain --ignore-daemonsets --delete-emptydir-data --force tools-k8s-worker-67 (T335543)
Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:04:01Z] <bd808> Rebooting tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud (T335543)
Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:07:03Z] <bd808> Hard reboot tools-k8s-worker-67.tools.eqiad1.wikimedia.cloud via horizon (T335543)
Mentioned in SAL (#wikimedia-cloud) [2023-04-28T00:09:41Z] <bd808> kubectl uncordon tools-k8s-worker-67 (T335543)
When draining tools-k8s-worker-67, the whois-referral pods stayed stuck and also seemed to prevent the soft reboot from taking effect even after being hit with a kill -9 hammer. The hard instance reboot via horizon semi-obviously ended the stuck processes.
This might have the same root issue as T335336 (nfs went away at some point, and it did not recover, leaving stuck processes in uninterruptible state)