kubernetes1014 unresponsive
kubernetes1014 failed multiple icinga checks since around 2022-02-06 16:40Z

I was unable to SSH but from k8s POV the node was still Ready so I drained the node at 06:10Z this morning.
The kubectl process is still running with all the Pods on kubernetes1014 listed as Terminated (but not yet evicted)-

Node conditions as of API:

Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
----                 ------  -----------------                 ------------------                ------                       -------
NetworkUnavailable   False   Thu, 09 Dec 2021 14:34:47 +0000   Thu, 09 Dec 2021 14:34:47 +0000   CalicoIsUp                   Calico is running on this node
MemoryPressure       False   Mon, 07 Feb 2022 07:50:56 +0000   Tue, 23 Mar 2021 09:53:07 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
DiskPressure         False   Mon, 07 Feb 2022 07:50:56 +0000   Wed, 29 Sep 2021 07:28:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
PIDPressure          False   Mon, 07 Feb 2022 07:50:56 +0000   Tue, 23 Mar 2021 09:53:07 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
Ready                False   Mon, 07 Feb 2022 07:50:56 +0000   Mon, 07 Feb 2022 06:15:04 +0000   KubeletNotReady              PLEG is not healthy: pleg was last seen active 1h38m54.037383153s ago; threshold is 3m0s

Connection via mgmt works, nothing in racadm logs. Console connects but stays blank without prompt. Virtual Console does not react on input.

Kubernetes events show a bunch of ImageGCFailed events ("failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded")[1] and prometheus metrics show a sharp increase in system load and iowait before the metrics go dark.


Find dumps of my tmux sessions (contains multiple kubectl describe node, catching different states) below for track record:

Nothing suspicious in kernel or syslog apart from the fact that logging stops with some random garbage on around 2022-02-06 16:31:12Z

I removed the downtime but did not yet uncordon the node

As there where no visible errors and the node seemed fine after reboot, I'll resolve this for now.

