[8755446.531733] INFO: task etcd:619 blocked for more than 120 seconds.
Causes entire node to be unresponsive to ssh, requires restart to bring back.
[8755446.531733] INFO: task etcd:619 blocked for more than 120 seconds.
Causes entire node to be unresponsive to ssh, requires restart to bring back.
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
tools: Add check for high iowait | operations/puppet | production | +8 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | hashar | T141673 Track labs instances hanging | |||
Resolved | hashar | T152599 Instance deadlocking in June 2016 | |||
Resolved | yuvipanda | T140256 etcd hosts hanging with kernel hang |
I've rebooted the host since I couldn't reach it in any other way. This has been dead for weeks at this point...
It's been clearly stuck for a long time...
The last Puppet run was at Sun Jul 3 03:44:20 UTC 2016 (15160 minutes ago).
From kern.log:
Jul 3 04:00:58 tools-k8s-etcd-02 etcd[515]: start to snapshot (applied: 39044422, lastsnap: 39034420) Jul 3 04:00:59 tools-k8s-etcd-02 etcd[515]: saved snapshot at index 39044422 Jul 3 04:00:59 tools-k8s-etcd-02 etcd[515]: compacted raft log at 39039422 Jul 3 04:01:19 tools-k8s-etcd-02 etcd[515]: purged file /var/lib/etcd/tools-k8s/member/snap/000000000000000d-00000000025301f0.snap successfully Jul 13 16:24:25 tools-k8s-etcd-02 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Monitoring triggered, same node again!
tools-k8s-etcd-02 login: [278880.136697] INFO: task jbd2/vda3-8:109 blocked for more than 120 seconds. [278880.147173] Not tainted 3.19.0-2-amd64 #1 [278880.149057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [278880.151112] INFO: task etcd:4073 blocked for more than 120 seconds. [278880.152341] Not tainted 3.19.0-2-amd64 #1 [278880.153296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mentioned in SAL [2016-07-18T12:41:03Z] <YuviPanda> reboot tools-k8s-etcd-02 for T140256
Mentioned in SAL [2016-07-18T12:48:19Z] <YuviPanda> reboot tools-flannel-etcd-03 for T140256
So tools-k8s-etcd-02 came back up after reboot, but flannel-etcd-03 did not - it came back and kept getting hung with various other processes entering same state. I'm going to move it to a different labvirt node now to see if that makes it better - there's a suspicion this is IO starvation related (tools-k8st-etcd-02 shows massive spike in iowait before it crashed, but nothing like that for flannel-etcd-03). Next step would be to upgrade the kernel as well, but not doing both at the same time to not confound causes.
Ok, the cold migrate of tools-flannel-etcd-03 seems to have failed - I still can't ssh into it, and
Current status is SHUTOFF; waiting for it to change to ACTIVE.
is where it's at. I'm going to wait for @Andrew to show up to see if the instance is completely hosed, and if so rebuild it.
I had similar kernel lock on a single instance in May T134981
Since at least mid june Jenkins job had been stuck for ever being stall attempting to write to the virtual disk T138281. There is even a trace at P3278 . Seems to be a kernel softlock trying to do disk/IO.
This task is probably a dupe of T124133: NFS overload is causing instances to freeze
tools-k8s-etcd-01 seems the same again today. No time to look into deeply atm so we are rebooting it to kick the can down the road.
Change 300334 had a related patch set uploaded (by Yuvipanda):
tools: Add check for high iowait
[140280.197047] INFO: task sshd:20819 blocked for more than 120 seconds. [140280.198328] Not tainted 4.4.0-1-amd64 #1 [140280.198794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [140280.199829] INFO: task ntpq:20820 blocked for more than 120 seconds. [140280.200635] Not tainted 4.4.0-1-amd64 #1 [140280.201117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [140400.202881] INFO: task jbd2/vda3-8:119 blocked for more than 120 seconds. [140400.212058] Not tainted 4.4.0-1-amd64 #1 [140400.212833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [140400.215025] INFO: task etcd:20794 blocked for more than 120 seconds. [140400.216151] Not tainted 4.4.0-1-amd64 #1 [140400.216926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
It also stopped reporting stats to graphite.
I'm migrating it to labvirt1013 to see if that helps, since this node has died previously too.
I see
[Sat Jul 23 21:44:05 2016] HTB: quantum of class 10002 is big. Consider r2q change.
repeated a bunch of times in dmesg, doesn't seem relevant though. And there's no spikes anywhere before it (some spikes in iostat after the restart)
I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.
That makes sense, the labvirt hosts were upgraded from 3.13 to 4.4 about a month ago.
CI had the same issue with jbd2/vda blocking (T138281) and I am pretty sure it was due to a kernel soft lock T138281#2395843 .
I have closed T129891: phlogiston-2 hangs every week as well.