kubernetes1014 failed multiple icinga checks since around 2022-02-06 16:40Z
I was unable to SSH but from k8s POV the node was still Ready so I drained the node at 06:10Z this morning.
The kubectl process is still running with all the Pods on kubernetes1014 listed as Terminated (but not yet evicted)-
Node conditions as of API:
Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Thu, 09 Dec 2021 14:34:47 +0000 Thu, 09 Dec 2021 14:34:47 +0000 CalicoIsUp Calico is running on this node MemoryPressure False Mon, 07 Feb 2022 07:50:56 +0000 Tue, 23 Mar 2021 09:53:07 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 07 Feb 2022 07:50:56 +0000 Wed, 29 Sep 2021 07:28:27 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 07 Feb 2022 07:50:56 +0000 Tue, 23 Mar 2021 09:53:07 +0000 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Mon, 07 Feb 2022 07:50:56 +0000 Mon, 07 Feb 2022 06:15:04 +0000 KubeletNotReady PLEG is not healthy: pleg was last seen active 1h38m54.037383153s ago; threshold is 3m0s
Connection via mgmt works, nothing in racadm logs. Console connects but stays blank without prompt. Virtual Console does not react on input.
Kubernetes events show a bunch of ImageGCFailed events ("failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded")[1] and prometheus metrics show a sharp increase in system load and iowait before the metrics go dark.
Find dumps of my tmux sessions (contains multiple kubectl describe node, catching different states) below for track record: