etcd hosts hanging with kernel hang
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jul 13 2016, 3:57 PM

Description

[8755446.531733] INFO: task etcd:619 blocked for more than 120 seconds.

Causes entire node to be unresponsive to ssh, requires restart to bring back.

Details

	Subject	Repo	Branch	Lines +/-
	tools: Add check for high iowait	operations/puppet	production	+8 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T141673 Track labs instances hanging
Resolved	hashar	T152599 Instance deadlocking in June 2016
Resolved	yuvipanda	T140256 etcd hosts hanging with kernel hang

Event Timeline

yuvipanda created this task.Jul 13 2016, 3:57 PM

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 13 2016, 3:57 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Joe subscribed.Jul 13 2016, 3:59 PM

I've rebooted the host since I couldn't reach it in any other way. This has been dead for weeks at this point...

It's been clearly stuck for a long time...

The last Puppet run was at Sun Jul  3 03:44:20 UTC 2016 (15160 minutes ago).

From kern.log:

Jul  3 04:00:58 tools-k8s-etcd-02 etcd[515]: start to snapshot (applied: 39044422, lastsnap: 39034420)
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: saved snapshot at index 39044422
Jul  3 04:00:59 tools-k8s-etcd-02 etcd[515]: compacted raft log at 39039422
Jul  3 04:01:19 tools-k8s-etcd-02 etcd[515]: purged file /var/lib/etcd/tools-k8s/member/snap/000000000000000d-00000000025301f0.snap successfully
Jul 13 16:24:25 tools-k8s-etcd-02 kernel: [    0.000000] Initializing cgroup subsys cpuset

That's a different timeperiod than when the other nodes died out...

I'm just going to close this out now - we have monitoring for it in T140246 and T140246. Will re-open if it happens again...

yuvipanda closed this task as Resolved.Jul 13 2016, 4:31 PM

yuvipanda claimed this task.

Monitoring triggered, same node again!

tools-k8s-etcd-02 login: [278880.136697] INFO: task jbd2/vda3-8:109 blocked for more than 120 seconds.
[278880.147173]       Not tainted 3.19.0-2-amd64 #1
[278880.149057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[278880.151112] INFO: task etcd:4073 blocked for more than 120 seconds.
[278880.152341]       Not tainted 3.19.0-2-amd64 #1
[278880.153296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL [2016-07-18T12:41:03Z] <YuviPanda> reboot tools-k8s-etcd-02 for T140256

Mentioned in SAL [2016-07-18T12:48:19Z] <YuviPanda> reboot tools-flannel-etcd-03 for T140256

So tools-k8s-etcd-02 came back up after reboot, but flannel-etcd-03 did not - it came back and kept getting hung with various other processes entering same state. I'm going to move it to a different labvirt node now to see if that makes it better - there's a suspicion this is IO starvation related (tools-k8st-etcd-02 shows massive spike in iowait before it crashed, but nothing like that for flannel-etcd-03). Next step would be to upgrade the kernel as well, but not doing both at the same time to not confound causes.

Ok, the cold migrate of tools-flannel-etcd-03 seems to have failed - I still can't ssh into it, and

Current status is SHUTOFF; waiting for it to change to ACTIVE.

is where it's at. I'm going to wait for @Andrew to show up to see if the instance is completely hosed, and if so rebuild it.

I had similar kernel lock on a single instance in May T134981

Since at least mid june Jenkins job had been stuck for ever being stall attempting to write to the virtual disk T138281. There is even a trace at P3278 . Seems to be a kernel softlock trying to do disk/IO.

This task is probably a dupe of T124133: NFS overload is causing instances to freeze

Closing again until it recurs.

Hmm, happened with similar symptoms on tools-worker-1004, which has no etcd.

tools-k8s-etcd-01 seems the same again today. No time to look into deeply atm so we are rebooting it to kick the can down the road.

tools-worker-1018 hit this just now.

Split off tracking worker nodes into T141017

I'm upgrading all etcd hosts to newer kernel (4.4)

Change 300334 had a related patch set uploaded (by Yuvipanda):
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

Change 300334 merged by Yuvipanda:
tools: Add check for high iowait

https://gerrit.wikimedia.org/r/300334

yuvipanda mentioned this in rOPUPe0af2714e171: tools: Add check for high iowait.Jul 21 2016, 7:36 PM

yuvipanda mentioned this in rOPUPe287ec1f1514: tools: Add check for high iowait.

yuvipanda mentioned this in rOPUP055f5b69dfe7: tools: Add check for high iowait.

yuvipanda mentioned this in rOPUP6164b2bb7bc4: tools: Add check for high iowait.

• chasemp mentioned this in rOPUP8b718755bdf8: tools: Alert when iowait info is missing as well.Jul 22 2016, 4:39 PM

yuvipanda mentioned this in rOPUPe879ecae060c: tools: Alert when iowait info is missing as well.Jul 22 2016, 4:39 PM

yuvipanda mentioned this in rOPUPb40eb74bc5f0: tools: Alert when iowait info is missing as well.Jul 23 2016, 12:41 AM

Happened to tools-k8s-etcd-02 again.

[140280.197047] INFO: task sshd:20819 blocked for more than 120 seconds.
[140280.198328]       Not tainted 4.4.0-1-amd64 #1
[140280.198794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140280.199829] INFO: task ntpq:20820 blocked for more than 120 seconds.
[140280.200635]       Not tainted 4.4.0-1-amd64 #1
[140280.201117] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.202881] INFO: task jbd2/vda3-8:119 blocked for more than 120 seconds.
[140400.212058]       Not tainted 4.4.0-1-amd64 #1
[140400.212833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[140400.215025] INFO: task etcd:20794 blocked for more than 120 seconds.
[140400.216151]       Not tainted 4.4.0-1-amd64 #1
[140400.216926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It also stopped reporting stats to graphite.

It's in labvirt1001 now.

I'm migrating it to labvirt1013 to see if that helps, since this node has died previously too.

I see

[Sat Jul 23 21:44:05 2016] HTB: quantum of class 10002 is big. Consider r2q change.

repeated a bunch of times in dmesg, doesn't seem relevant though. And there's no spikes anywhere before it (some spikes in iostat after the restart)

MoritzMuehlenhoff subscribed.Jul 25 2016, 4:34 PM

yuvipanda mentioned this in T141673: Track labs instances hanging .Jul 29 2016, 11:57 PM

• chasemp mentioned this in T124133: NFS overload is causing instances to freeze.Aug 3 2016, 3:07 AM

yuvipanda mentioned this in T139707: Flannel is sometimes flaky.Oct 28 2016, 6:24 PM

scfc removed a project: Patch-For-Review.Nov 27 2016, 12:20 AM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

hashar unsubscribed.Nov 27 2016, 8:25 PM

hashar subscribed.Nov 28 2016, 11:42 AM

In T140256#2825802, @hashar wrote:

I haven't seen that kernel soft lock occurring for a while. I guess it was a bug in the kernel that ran on labvirt hosts.

That makes sense, the labvirt hosts were upgraded from 3.13 to 4.4 about a month ago.