Page MenuHomePhabricator

etcd hosts hanging with kernel hang
Closed, ResolvedPublic

Description

[8755446.531733] INFO: task etcd:619 blocked for more than 120 seconds.

Causes entire node to be unresponsive to ssh, requires restart to bring back.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

I've rebooted the host since I couldn't reach it in any other way. This has been dead for weeks at this point...

It's been clearly stuck for a long time...

The last Puppet run was at Sun Jul  3 03:44:20 UTC 2016 (15160 minutes ago).

From kern.log:

Jul  3 04:00:58 tools-k8s-etcd-02 etcd[515]: start to snapshot (applied: 39044422, lastsnap: 39034420)
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: saved snapshot at index 39044422
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: compacted raft log at 39039422
Jul  3 04:01:19 tools-k8s-etcd-02 etcd[515]: purged file /var/lib/etcd/tools-k8s/member/snap/000000000000000d-00000000025301f0.snap successfully
Jul 13 16:24:25 tools-k8s-etcd-02 kernel: [    0.000000] Initializing cgroup subsys cpuset

That's a different timeperiod than when the other nodes died out...

I'm just going to close this out now - we have monitoring for it in T140246 and T140246. Will re-open if it happens again...

yuvipanda claimed this task.

Monitoring triggered, same node again!

tools-k8s-etcd-02 login: [278880.136697] INFO: task jbd2/vda3-8:109 blocked for more than 120 seconds.
[278880.147173]       Not tainted 3.19.0-2-amd64 #1
[278880.149057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[278880.151112] INFO: task etcd:4073 blocked for more than 120 seconds.
[278880.152341]       Not tainted 3.19.0-2-amd64 #1
[278880.153296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL [2016-07-18T12:41:03Z] <YuviPanda> reboot tools-k8s-etcd-02 for T140256

Mentioned in SAL [2016-07-18T12:48:19Z] <YuviPanda> reboot tools-flannel-etcd-03 for T140256

So tools-k8s-etcd-02 came back up after reboot, but flannel-etcd-03 did not - it came back and kept getting hung with various other processes entering same state. I'm going to move it to a different labvirt node now to see if that makes it better - there's a suspicion this is IO starvation related (tools-k8st-etcd-02 shows massive spike in iowait before it crashed, but nothing like that for flannel-etcd-03). Next step would be to upgrade the kernel as well, but not doing both at the same time to not confound causes.

Ok, the cold migrate of tools-flannel-etcd-03 seems to have failed - I still can't ssh into it, and

Current status is SHUTOFF; waiting for it to change to ACTIVE.

is where it's at. I'm going to wait for @Andrew to show up to see if the instance is completely hosed, and if so rebuild it.

I had similar kernel lock on a single instance in May T134981

Since at least mid june Jenkins job had been stuck for ever being stall attempting to write to the virtual disk T138281. There is even a trace at P3278 . Seems to be a kernel softlock trying to do disk/IO.

This task is probably a dupe of T124133: NFS overload is causing instances to freeze

Closing again until it recurs.

Hmm, happened with similar symptoms on tools-worker-1004, which has no etcd.

chasemp subscribed.

tools-k8s-etcd-01 seems the same again today. No time to look into deeply atm so we are rebooting it to kick the can down the road.

tools-worker-1018 hit this just now.

Split off tracking worker nodes into T141017

I'm upgrading all etcd hosts to newer kernel (4.4)

Change 300334 had a related patch set uploaded (by Yuvipanda):
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Change 300334 merged by Yuvipanda:
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Happened to tools-k8s-etcd-02 again.

[140280.197047] INFO: task sshd:20819 blocked for more than 120 seconds.
[140280.198328]       Not tainted 4.4.0-1-amd64 #1
[140280.198794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140280.199829] INFO: task ntpq:20820 blocked for more than 120 seconds.
[140280.200635]       Not tainted 4.4.0-1-amd64 #1
[140280.201117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.202881] INFO: task jbd2/vda3-8:119 blocked for more than 120 seconds.
[140400.212058]       Not tainted 4.4.0-1-amd64 #1
[140400.212833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.215025] INFO: task etcd:20794 blocked for more than 120 seconds.
[140400.216151]       Not tainted 4.4.0-1-amd64 #1
[140400.216926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It also stopped reporting stats to graphite.

I'm migrating it to labvirt1013 to see if that helps, since this node has died previously too.

I see

[Sat Jul 23 21:44:05 2016] HTB: quantum of class 10002 is big. Consider r2q change.

repeated a bunch of times in dmesg, doesn't seem relevant though. And there's no spikes anywhere before it (some spikes in iostat after the restart)

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

That makes sense, the labvirt hosts were upgraded from 3.13 to 4.4 about a month ago.

CI had the same issue with jbd2/vda blocking (T138281) and I am pretty sure it was due to a kernel soft lock T138281#2395843 .

I have closed T129891: phlogiston-2 hangs every week as well.