Page MenuHomePhabricator

etcd hosts hanging with kernel hang
Closed, ResolvedPublic

Description

[8755446.531733] INFO: task etcd:619 blocked for more than 120 seconds.

Causes entire node to be unresponsive to ssh, requires restart to bring back.

Details

Related Gerrit Patches:
operations/puppet : productiontools: Add check for high iowait

Event Timeline

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 13 2016, 3:57 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Joe added a subscriber: Joe.Jul 13 2016, 3:59 PM

I've rebooted the host since I couldn't reach it in any other way. This has been dead for weeks at this point...

It's been clearly stuck for a long time...

The last Puppet run was at Sun Jul  3 03:44:20 UTC 2016 (15160 minutes ago).

From kern.log:

Jul  3 04:00:58 tools-k8s-etcd-02 etcd[515]: start to snapshot (applied: 39044422, lastsnap: 39034420)
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: saved snapshot at index 39044422
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: compacted raft log at 39039422
Jul  3 04:01:19 tools-k8s-etcd-02 etcd[515]: purged file /var/lib/etcd/tools-k8s/member/snap/000000000000000d-00000000025301f0.snap successfully
Jul 13 16:24:25 tools-k8s-etcd-02 kernel: [    0.000000] Initializing cgroup subsys cpuset

That's a different timeperiod than when the other nodes died out...

I'm just going to close this out now - we have monitoring for it in T140246 and T140246. Will re-open if it happens again...

yuvipanda closed this task as Resolved.Jul 13 2016, 4:31 PM
yuvipanda claimed this task.
yuvipanda reopened this task as Open.Jul 18 2016, 12:40 PM

Monitoring triggered, same node again!

tools-k8s-etcd-02 login: [278880.136697] INFO: task jbd2/vda3-8:109 blocked for more than 120 seconds.
[278880.147173]       Not tainted 3.19.0-2-amd64 #1
[278880.149057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[278880.151112] INFO: task etcd:4073 blocked for more than 120 seconds.
[278880.152341]       Not tainted 3.19.0-2-amd64 #1
[278880.153296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL [2016-07-18T12:41:03Z] <YuviPanda> reboot tools-k8s-etcd-02 for T140256

Mentioned in SAL [2016-07-18T12:48:19Z] <YuviPanda> reboot tools-flannel-etcd-03 for T140256

So tools-k8s-etcd-02 came back up after reboot, but flannel-etcd-03 did not - it came back and kept getting hung with various other processes entering same state. I'm going to move it to a different labvirt node now to see if that makes it better - there's a suspicion this is IO starvation related (tools-k8st-etcd-02 shows massive spike in iowait before it crashed, but nothing like that for flannel-etcd-03). Next step would be to upgrade the kernel as well, but not doing both at the same time to not confound causes.

Ok, the cold migrate of tools-flannel-etcd-03 seems to have failed - I still can't ssh into it, and

Current status is SHUTOFF; waiting for it to change to ACTIVE.

is where it's at. I'm going to wait for @Andrew to show up to see if the instance is completely hosed, and if so rebuild it.

hashar added a subscriber: hashar.Jul 18 2016, 2:44 PM

I had similar kernel lock on a single instance in May T134981

Since at least mid june Jenkins job had been stuck for ever being stall attempting to write to the virtual disk T138281. There is even a trace at P3278 . Seems to be a kernel softlock trying to do disk/IO.

This task is probably a dupe of T124133: NFS overload is causing instances to freeze

yuvipanda closed this task as Resolved.Jul 19 2016, 5:35 PM

Closing again until it recurs.

Hmm, happened with similar symptoms on tools-worker-1004, which has no etcd.

chasemp reopened this task as Open.Jul 20 2016, 9:28 PM
chasemp added a subscriber: chasemp.

tools-k8s-etcd-01 seems the same again today. No time to look into deeply atm so we are rebooting it to kick the can down the road.

yuvipanda triaged this task as High priority.Jul 21 2016, 5:09 PM

tools-worker-1018 hit this just now.

Split off tracking worker nodes into T141017

I'm upgrading all etcd hosts to newer kernel (4.4)

Change 300334 had a related patch set uploaded (by Yuvipanda):
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Change 300334 merged by Yuvipanda:
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Happened to tools-k8s-etcd-02 again.

[140280.197047] INFO: task sshd:20819 blocked for more than 120 seconds.
[140280.198328]       Not tainted 4.4.0-1-amd64 #1
[140280.198794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140280.199829] INFO: task ntpq:20820 blocked for more than 120 seconds.
[140280.200635]       Not tainted 4.4.0-1-amd64 #1
[140280.201117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.202881] INFO: task jbd2/vda3-8:119 blocked for more than 120 seconds.
[140400.212058]       Not tainted 4.4.0-1-amd64 #1
[140400.212833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.215025] INFO: task etcd:20794 blocked for more than 120 seconds.
[140400.216151]       Not tainted 4.4.0-1-amd64 #1
[140400.216926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It also stopped reporting stats to graphite.

It's in labvirt1001 now.

I'm migrating it to labvirt1013 to see if that helps, since this node has died previously too.

I see

[Sat Jul 23 21:44:05 2016] HTB: quantum of class 10002 is big. Consider r2q change.

repeated a bunch of times in dmesg, doesn't seem relevant though. And there's no spikes anywhere before it (some spikes in iostat after the restart)

scfc moved this task from Triage to Backlog on the Toolforge board.

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

hashar removed a subscriber: hashar.Nov 27 2016, 8:25 PM
hashar added a subscriber: hashar.Nov 28 2016, 11:42 AM

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

That makes sense, the labvirt hosts were upgraded from 3.13 to 4.4 about a month ago.

hashar closed this task as Resolved.Dec 7 2016, 12:44 PM

CI had the same issue with jbd2/vda blocking (T138281) and I am pretty sure it was due to a kernel soft lock T138281#2395843 .

I have closed T129891: phlogiston-2 hangs every week as well.