Page MenuHomePhabricator

kubernetes1014 unresponsive
Closed, ResolvedPublic

Description

kubernetes1014 failed multiple icinga checks since around 2022-02-06 16:40Z

I was unable to SSH but from k8s POV the node was still Ready so I drained the node at 06:10Z this morning.
The kubectl process is still running with all the Pods on kubernetes1014 listed as Terminated (but not yet evicted)-

Node conditions as of API:

Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
----                 ------  -----------------                 ------------------                ------                       -------
NetworkUnavailable   False   Thu, 09 Dec 2021 14:34:47 +0000   Thu, 09 Dec 2021 14:34:47 +0000   CalicoIsUp                   Calico is running on this node
MemoryPressure       False   Mon, 07 Feb 2022 07:50:56 +0000   Tue, 23 Mar 2021 09:53:07 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
DiskPressure         False   Mon, 07 Feb 2022 07:50:56 +0000   Wed, 29 Sep 2021 07:28:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
PIDPressure          False   Mon, 07 Feb 2022 07:50:56 +0000   Tue, 23 Mar 2021 09:53:07 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
Ready                False   Mon, 07 Feb 2022 07:50:56 +0000   Mon, 07 Feb 2022 06:15:04 +0000   KubeletNotReady              PLEG is not healthy: pleg was last seen active 1h38m54.037383153s ago; threshold is 3m0s

Connection via mgmt works, nothing in racadm logs. Console connects but stays blank without prompt. Virtual Console does not react on input.

Kubernetes events show a bunch of ImageGCFailed events ("failed to get image stats: rpc error: code = DeadlineExceeded desc = context deadline exceeded")[1] and prometheus metrics show a sharp increase in system load and iowait before the metrics go dark.

[1] https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-syslog-2022.02.06?id=DhT3z34BoAyk87sqsdPt

Find dumps of my tmux sessions (contains multiple kubectl describe node, catching different states) below for track record:

Event Timeline

JMeybohm triaged this task as High priority.Feb 7 2022, 7:53 AM
JMeybohm created this task.

Nothing suspicious in kernel or syslog apart from the fact that logging stops with some random garbage on around 2022-02-06 16:31:12Z

I removed the downtime but did not yet uncordon the node

JMeybohm claimed this task.

As there where no visible errors and the node seemed fine after reboot, I'll resolve this for now.

JMeybohm renamed this task from kubernetes1014 unreachable to kubernetes1014 unresponsive.Feb 8 2022, 10:00 AM
JMeybohm updated the task description. (Show Details)