Track labs instances hanging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jul 29 2016, 11:57 PM

Description

This is purely to track labs instances hanging with messages about hung tasks in kernel / console log. There is a subtask: T140256 for tools k8s and T141017 for k8s worker nodes. T124133 also has more information.

Instance	OS	virt host	time	ssh	iowait	io spike	graphite data	nfs	recovery
deployment-logstash2	Jessie	labvirt1010	27 jul 09:00Z	no	no	no	no	no	?
tools-merlbot-proxy	Jessie	labvirt1010	31 jul 01:00Z	no	yes	regular spikes to 300 (typically 50)	no	yes?	rebooted
novaproxy-01	Jessie	labvirt1001	2 Aug ~21:15Z	no	no	no	no	yes	reboot
librarybase-reston-01	Jessie	labvirt1006	07 Jul ~0200Z	no	no	no	no	no	reboot
tools-grid-master	trusty	labvirt1010	2 Aug ~14:00Z	no	yes	no	no	yes	slow and drawn out reboot
tools-webgrid-lighttpd-1202	precise	labvirt1019	9 Aug ~05:47Z	no	no	no	no	yes	reboot
extdist-01	jessie	labvirt1009	Aug 17 10:19:37 (a)	no response	yes	yes (regular, still happening)	yes	no	reboot
extdist-02	jessie	labvirt1010	Jul 23 03:19:11 (a)	hangs after handshake	no	yes (single)	no	no	reboot
tools-mail	precise	labvirt1009	Aug 19 00:35:01 (a)	hangs after handshake	no (max 1)	yes (jumps to steady 15k)	no	yes	slow reboot
tools-exec-1217	precise	labvirt1001	24 Aug ~04 AM UTC	nope	nope	no	no	yes	???

(a) last readable entry in /var/log/syslog before reboot

tools-puppetmaster-01

jessie

labvirt1001

Sep 20 13:54:01

yes

reboot

Legend

Instance: Name of instance
virt host: the virt host the instance is hosted in (you can find this info on wikitech.wikimedia.org/wiki/Nova_Resource:$fqdn)
time: Time at which the host hung
ssh: If you could ssh into it still
iowait: If the host had high iowait (you can find this from https://graphite-labs.wikimedia.org/), e.g. tools.tools-merlbot-proxy.cpu.total.iowait
--> https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1469982906.471&target=tools.tools-merlbot-proxy.cpu.total.iowait

iospike: If the host had io spikes (you can find this from https://graphite-labs.wikimedia.org/), e.g. tools.tools-merlbot-proxy.iostat.vda.io. vdb, ..., might also exist?
--> https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1469983050.092&target=tools.tools-merlbot-proxy.iostat.vda.io

graphite data: If the host is sending graphite data even after it hung
nfs: If the host had NFS mounted

Details

	Subject	Repo	Branch	Lines +/-
	labs: Enable libvirtkvm collector	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	hashar	T141673 Track labs instances hanging
Resolved	None	T142866 Upgrade qemu on labvirts
Resolved	hashar	T152599 Instance deadlocking in June 2016
Resolved	hashar	T129891 phlogiston-2 hangs every week
Resolved	yuvipanda	T140256 etcd hosts hanging with kernel hang
Resolved	hashar	T138281 Some Jenkins jobs tend to be stuck and never times out

Event Timeline

yuvipanda created this task.Jul 29 2016, 11:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 29 2016, 11:57 PM

bd808 subscribed.Jul 29 2016, 11:58 PM

Before deployment-logstash2 was restarted for the 2016-07-27 hang it had this console output via wikitech:

P3606 deployment-logstash2 console when hung

1	[10847049.612128] INFO: task jbd2/vda3-8:159 blocked for more than 120 seconds.
2	[10847049.618465] Not tainted 3.16.0-4-amd64 #1
3	[10847049.619349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
4	[10847049.621800] INFO: task jbd2/dm-0-8:308 blocked for more than 120 seconds.
5	[10847049.623079] Not tainted 3.16.0-4-amd64 #1
6	[10847049.623978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
7	[10847049.625684] INFO: task nscd:15726 blocked for more than 120 seconds.
8	[10847049.626891] Not tainted 3.16.0-4-amd64 #1
9	[10847049.628953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
10	[10847049.629787] INFO: task java:10739 blocked for more than 120 seconds.
11	[10847049.630349] Not tainted 3.16.0-4-amd64 #1
12	[10847049.630765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
13	[10847049.631580] INFO: task java:10747 blocked for more than 120 seconds.
14	[10847049.632347] Not tainted 3.16.0-4-amd64 #1
15	[10847049.632751] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16	[10847049.633550] INFO: task java:10833 blocked for more than 120 seconds.
17	[10847049.634100] Not tainted 3.16.0-4-amd64 #1
18	[10847049.634505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
19	[10847049.635257] INFO: task java:10848 blocked for more than 120 seconds.
20	[10847049.635833] Not tainted 3.16.0-4-amd64 #1
21	[10847049.636426] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
22	[10847049.637165] INFO: task java:11608 blocked for more than 120 seconds.
23	[10847049.637719] Not tainted 3.16.0-4-amd64 #1
24	[10847049.638121] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
25	[10847049.638944] INFO: task java:15282 blocked for more than 120 seconds.
26	[10847049.639505] Not tainted 3.16.0-4-amd64 #1
27	[10847049.640721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
28	[10847049.641516] INFO: task <tcp:15292 blocked for more than 120 seconds.
29	[10847049.642078] Not tainted 3.16.0-4-amd64 #1
30	[10847049.642489] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Can you clarify what the columns mean? tools-merlbot-proxy is currently hanging with

https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&instanceid=7b12cce8-b045-4fed-96a8-f5fe1a11e0f9&project=tools&region=eqiad

[4518960.281832] INFO: task jbd2/vda3-8:107 blocked for more than 120 seconds.
[4518960.282912]       Not tainted 3.19.0-2-amd64 #1
[4518960.283470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

@valhallasw I added a legend.

valhallasw updated the task description. (Show Details)Jul 31 2016, 4:41 PM

• chasemp updated the task description. (Show Details)Aug 2 2016, 7:27 PM

yuvipanda updated the task description. (Show Details)Aug 2 2016, 9:50 PM

yuvipanda updated the task description. (Show Details)Aug 2 2016, 9:56 PM

yuvipanda updated the task description. (Show Details)Aug 2 2016, 10:06 PM

Andrew updated the task description. (Show Details)Aug 2 2016, 10:25 PM

• chasemp mentioned this in T124133: NFS overload is causing instances to freeze.Aug 3 2016, 3:07 AM

This issue seems to be prevalent and maybe limited to Jessie hosts but not Jessie hosts built from the same exact image. It has seemed to affect k8s infrastructure but it's entirely possible that a) k8s is the majority of jessie hosts in Tools and b) we notice Tool hosts far more proactively than Jessie hosts anywhere else. We really need T130806 as most periods of unavailable also come with a lack of atop, syslog, and access that is making all of this more difficulty still. We had established a basic logging aggregation node for k8s (T141270) as it seemed so focused there but with novaproxy-01 going down today and displaying such similar symptoms it may be wider (and we missed it not having all of Tools sending in logs). It's also interesting CI had (what I take to be) really similar Jessie related issues that mysteriously disappeared(?) in T138281

Console from novaproxy-01 at the time of unavailability

[11409409.150117] INFO: task cron:14372 blocked for more than 120 seconds.
[11409409.151003]       Not tainted 3.16.0-4-amd64 #1
[11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds.
[11409409.153825]       Not tainted 3.16.0-4-amd64 #1
[11409409.155729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11409409.156902] INFO: task cron:14374 blocked for more than 120 seconds.
[11409409.157905]       Not tainted 3.16.0-4-amd64 #1
[11409409.159943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

https://phabricator.wikimedia.org/P3634

yuvipanda updated the task description. (Show Details)Aug 4 2016, 2:48 AM

Harej added a project: VPS-project-Librarybase.Aug 4 2016, 3:26 AM

Harej subscribed.

yuvipanda edited projects, added Cloud-VPS; removed VPS-project-Librarybase.Aug 4 2016, 3:27 AM

yuvipanda updated the task description. (Show Details)Aug 10 2016, 1:25 AM

tools-grid-master, first trusty instance to be dead! It also saw super spikes in load and iowait before it died, *but* the processes on it are still functioning, with the grid still operating normally.

yuvipanda mentioned this in T141017: Kubernetes worker nodes hanging.Aug 10 2016, 5:38 PM

deployment-cache-upload04, still successfully running upload.beta.wmflabs.org on HTTP/HTTPS and responding to pings, but doesn't respond to SSH or Salt.

yuvipanda updated the task description. (Show Details)Aug 12 2016, 2:09 AM

yuvipanda updated the task description. (Show Details)Aug 12 2016, 2:12 AM

So I dug in and grepped all the logs for all active instances from openstack, and the following 33 instances have some form or other of:

[12156856.840147] INFO: task log-log.tcl:7702 blocked for more than 120 seconds.

for various tasks in them.

uploadwizard-osm-equiad
etcd03
tools-bastion-03
phabricator
tools-exec-1207
extdist-02
tools-exec-1212
tools-exec-1219
tools-exec-1217
otrs-memoryleak
openvas
filippo-test-precise2
tools-exec-1216
test-prometheus5
jessie-bastion-01
confd-precise
filippo-test-jessie
tools-exec-1204
deployment-cache-upload04
wpx-prod-01
toolsbeta-proxy-02
udpmx-01
mdc2002
db2003
ifttt-staging
tools-exec-1211
tools-k8s-etcd-03
ores-web-04
mc2001
diff-forge1
druid103
tools-exec-1220
filippo-test-jessie2

I'm going to try to figure out how many of them have working ssh now...

Harej added a subscriber: Earwig.Aug 12 2016, 3:36 AM

I can only ssh with my root key into the following instances:

druid103.eqiad.wmflabs
tools-exec-1220.eqiad.wmflabs
tools-bastion-03.eqiad.wmflabs
tools-exec-1207.eqiad.wmflabs
tools-exec-1212.eqiad.wmflabs
tools-exec-1216.eqiad.wmflabs
tools-exec-1217.eqiad.wmflabs
tools-exec-1219.eqiad.wmflabs
tools-exec-1219.eqiad.wmflabs
tools-exec-1220.eqiad.wmflabs
tools-bastion-03.eqiad.wmflabs
tools-exec-1207.eqiad.wmflabs
tools-exec-1212.eqiad.wmflabs
tools-exec-1216.eqiad.wmflabs
tools-exec-1217.eqiad.wmflabs
tools-exec-1219.eqiad.wmflabs

so almost exactly half?!

Looking at distribution of labvirts...

root@labcontrol1001:/home/yuvipanda/20160811# cat hosts | sort | uniq -c
      3 labvirt1001.eqiad.wmnet
      1 labvirt1002.eqiad.wmnet
      2 labvirt1004.eqiad.wmnet
      2 labvirt1006.eqiad.wmnet
      3 labvirt1007.eqiad.wmnet
      8 labvirt1008.eqiad.wmnet
      1 labvirt1009.eqiad.wmnet
     10 labvirt1010.eqiad.wmnet
      3 labvirt1011.eqiad.wmnet

And for images / OS...

16 debian-8.*-jessie
11 ubuntu-12.04-precise
 6 ubuntu-14.04-trusty

Mentioned in SAL [2016-08-12T19:14:49Z] <Krenair> rebooting deployment-cache-upload04, it's stuck in https://phabricator.wikimedia.org/T141673 and varnish is no longer working there afaict, so trying to bring upload.beta.wmflabs.org back up

yuvipanda mentioned this in T142866: Upgrade qemu on labvirts.Aug 12 2016, 9:32 PM

yuvipanda created subtask T142866: Upgrade qemu on labvirts.

Actual ssh failing just now in:

(clush)yuvipanda@fearless:~/code/clush$ clush -l root --hostfile=files -b 'echo hi'
extdist-02.eqiad.wmflabs: ssh_exchange_identification: Connection closed by remote host
confd-precise.eqiad.wmflabs: Connection timed out during banner exchange
db2003.eqiad.wmflabs: Connection timed out during banner exchange
etcd03.eqiad.wmflabs: Connection timed out during banner exchange
filippo-test-jessie2.eqiad.wmflabs: Connection timed out during banner exchange
filippo-test-precise2.eqiad.wmflabs: Connection timed out during banner exchange
ifttt-staging.eqiad.wmflabs: Connection timed out during banner exchange
jessie-bastion-01.eqiad.wmflabs: Connection timed out during banner exchange
mc2001.eqiad.wmflabs: Connection timed out during banner exchange
mdc2002.eqiad.wmflabs: Connection timed out during banner exchange
toolsbeta-proxy-02.eqiad.wmflabs: Connection timed out during banner exchange
openvas.eqiad.wmflabs: Connection timed out during banner exchange
ores-web-04.eqiad.wmflabs: Connection timed out during banner exchange
tools-exec-1204.eqiad.wmflabs: packet_write_wait: Connection to UNKNOWN port 0: Broken pipe
uploadwizard-osm-equiad.eqiad.wmflabs: Connection timed out during banner exchange
otrs-memoryleak.eqiad.wmflabs: Connection timed out during banner exchange
test-prometheus5.eqiad.wmflabs: Connection timed out during banner exchange
tools-exec-1211.eqiad.wmflabs: packet_write_wait: Connection to UNKNOWN port 0: Broken pipe
filippo-test-jessie.eqiad.wmflabs: Connection timed out during banner exchange
clush: 30/33

breakdown:

extdist-02 16a0fd3f-c751-4686-8c77-e84b58d42eed debian-8.0-jessie
confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise
db2003 a1c342e1-8725-46a1-a116-74bf39549bcb ubuntu-14.04-trusty
etcd03 06eeac90-a10e-4c7c-8c49-bbb566fc3936 debian-8.0-jessie
filippo-test-jessie2 f8ddd70f-1b05-43c8-863f-55d6c2d1b59e debian-8.3-jessie
filippo-test-precise2 644872c4-db70-411f-8bde-f48346785e11 ubuntu-12.04-precise
ifttt-staging-01 4d70a62f-87f9-47a8-ad09-db7aa58a18b3 debian-8.3-jessie
ifttt-staging b57b5ff1-3619-4666-af91-f9b156566810 debian-8.3-jessie
jessie-bastion-01 6bbdb413-4b3e-44f9-8472-153402b09126 debian-8.3-jessie
mc2001 df6c5ac6-72f3-4c57-96ee-d48d9ff27bab ubuntu-14.04-trusty
mdc2002 9c5daf34-8e63-485c-a0fc-52b3a2443d86 ubuntu-14.04-trusty
toolsbeta-proxy-02 950f686c-eec5-4ecc-b2bc-f192fee4e95c debian-8.1-jessie
openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
ores-web-04 dbd8f69a-caef-4587-859c-40d8aff14e98 debian-8.5-jessie
tools-exec-1204 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 ubuntu-12.04-precise
uploadwizard-osm-equiad 045fda94-73f2-4e58-a315-e26fbe268970 ubuntu-12.04-precise
otrs-memoryleak 46259ec3-a3ae-4129-97f6-21bd564f4a31 debian-8.3-jessie
test-prometheus5 67e83f10-208d-429d-ba24-60c6abf589e6 debian-8.3-jessie
tools-exec-1211 bfc9c147-cb7b-4c16-8d92-b37524605a12 ubuntu-12.04-precise
filippo-test-jessie2 f8ddd70f-1b05-43c8-863f-55d6c2d1b59e debian-8.3-jessie
filippo-test-jessie 7c38738f-babe-43a8-b9bc-18761f2a6fd2 debian-8.0-jessie

of these I have looked at tools-exec-1211 and tools-exec-1204 so far which according to nova diagnostics are still processing some io making them probably not a candidate for the same issue.

nonjessie candidates

confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise
db2003 a1c342e1-8725-46a1-a116-74bf39549bcb ubuntu-14.04-trusty
filippo-test-precise2 644872c4-db70-411f-8bde-f48346785e11 ubuntu-12.04-precise
mc2001 df6c5ac6-72f3-4c57-96ee-d48d9ff27bab ubuntu-14.04-trusty
mdc2002 9c5daf34-8e63-485c-a0fc-52b3a2443d86 ubuntu-14.04-trusty
openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
tools-exec-1204 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 ubuntu-12.04-precise
uploadwizard-osm-equiad 045fda94-73f2-4e58-a315-e26fbe268970 ubuntu-12.04-precise
tools-exec-1211 bfc9c147-cb7b-4c16-8d92-b37524605a12 ubuntu-12.04-precise

inconclusive so far are:

openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise

Change 304942 had a related patch set uploaded (by Yuvipanda):
labs: Enable libvirtkvm collector

https://gerrit.wikimedia.org/r/304942

gerritbot added a project: Patch-For-Review.Aug 15 2016, 11:24 PM

yuvipanda mentioned this in rOPUP01e607ddec6a: labs: Enable libvirtkvm collector.Aug 15 2016, 11:27 PM

Change 304942 merged by Yuvipanda:
labs: Enable libvirtkvm collector

https://gerrit.wikimedia.org/r/304942

@Krenair just told me deployment-redis02 is in a similar state, and I see there has been no changes in the disk io numbers from nova for a while. It's a trusty instance with no NFS.

valhallasw mentioned this in T143209: http://extdist.wmflabs.org/ 502's (Bad Gateway).Aug 17 2016, 1:47 PM

valhallasw updated the task description. (Show Details)Aug 18 2016, 10:09 AM

• chasemp mentioned this in T143405: Move labs 'instances' data to graphite labs.Aug 19 2016, 1:27 PM

valhallasw updated the task description. (Show Details)Aug 20 2016, 11:49 AM

yuvipanda triaged this task as High priority.Aug 24 2016, 5:26 PM

yuvipanda updated the task description. (Show Details)Aug 24 2016, 5:51 PM

• chasemp updated the task description. (Show Details)Aug 24 2016, 9:37 PM

Caught yesterday

Candidate: tools-exec-1217

Details: stuck or frozen and can do no IO (amount other things such as SSH inaccessible completely)

I used virsh domblkstat i-00000bde to diagnose the io stuck

I looked into the process on labvirt1001 and documented ouput here: https://phabricator.wikimedia.org/P3886. @BBlack was kind enough to lend another set of hands and eyes but we were unable to determine the cause of the process lockup.

when I try to attach to the process gdb I see /usr/bin/qemu-system-x86_64 (deleted): No such file or directory

:* I do not see the above when connecting to well behaved instance on labvirt1014
:* I am able to do gdb /proc/62648/exe -p 62648

atop says integration-slave-trusty-1013 is a pretty big disk user and even around that time frame but that should be small potatoes
A virsh reboot did not return or restore the host. A virsh destroy (yank power plug) was required followed by a virsh start.

Seems to have become unavailable at 4 a.m. local time:

Screen Shot 2016-08-25 at 8.06.18 AM.png (533×897 px, 66 KB)

No noticeable usage tiggers in the instance itself:

https://graphite-labs.wikimedia.org/render/?width=901&height=535&_salt=1472072883.003&target=tools.tools-exec-1217.loadavg.05&target=tools.tools-exec-1217.cpu.total.iowait&target=tools.tools-exec-1217.iostat.vda.iops&target=tools.tools-exec-1217.iostat.vda.io&from=-2d

Screen Shot 2016-08-25 at 8.09.04 AM.png (534×897 px, 46 KB)

there is a load surge several hours before on the labvirt but it doesn't seem to be huge.

Screen Shot 2016-08-25 at 8.13.57 AM.png (535×897 px, 92 KB)

no memory usage correlation on the underlying host that I can tell

Screen Shot 2016-08-25 at 8.17.37 AM.png (536×899 px, 85 KB)

Log messages of note:

from labvirtd

2016-08-24 17:25:00.553+0000: 16223: error : qemuDomainGetIOThreadsLive:5776 : unsupported configuration: IOThreads not supported with this binary

(it's possible that iothreads is me trying to issue: virsh iothreadinfo i-00000bde)

016-08-24 19:24:12.352+0000: 16226: warning : qemuDomainObjBeginJobInternal:1467 : Cannot start job (query, none) for domain i-00000bde; current job is (query, none) owned by (16222 remoteDispatchDomainBlockStats, 0 <null>) for (48s, 0s)

[Wed Aug 24 18:03:10 2016] WARNING: CPU: 45 PID: 12767 at /build/linux-lts-utopic-9LCANk/linux-lts-utopic-3.16.0/security/apparmor/label.c:1767 __aa_labelset_update_all+0x48d/0x700()

Wed Aug 24 18:03:10 2016] AppArmor WARN label_update: ((aa_label_remove_and_insert((&(((label)->ent[(label)->size - 1])->ns)->labels), label, l) != l)):

[16:27:05] <bblack> so, the last thing I'm still looking at is the strace on 64658
[16:27:35] <bblack> the pattern there is several fast KVM_RUN invocations with this in the middle of them:
[16:27:38] <bblack> ioctl(15, KVM_RUN, 0)                   = 0
[16:27:40] <bblack> write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
[16:27:43] <bblack> futex(0x7f44fa5e89a0, FUTEX_WAKE_PRIVATE, 1) = 1
[16:27:45] <bblack> ioctl(15, KVM_RUN, 0)                   = 0
[16:27:52] <bblack> and then it eventually hangs in a KVM_RUN ioctl
[16:27:55] <chasemp> I saw that too but can't make heads or tails of it
[16:28:02] <bblack> and after the pausy/hangy one, it always does:
[16:28:08] <bblack> ioctl(10, KVM_IRQ_LINE_STATUS, 0x7f44e979fa30) = 0
[16:28:13] <chasemp> hmmm
[16:28:28] <bblack> I've seen some bug pointers on that, but again they seem old
[16:28:40] <bblack> one mentioned that they fixed a similar bug by upgrading the guest kernel, because of some bug in it
[16:28:56] <bblack> but unlikely the case if you're seeing this on jessie+trusty+precise guests

considered:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1378123
https://bugs.launchpad.net/nova/+bug/1254872
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/comments/13

Just happened to tools-k8s-etcd-01. Given it's a friday evening and I'm trying really hard to not work this weekend, I've just migrated it to a different host and rebooted it.

google-api-proxy-02.google-api-proxy.eqiad.wmflabs
noticed: 2016-09-13T16:00
ssh: no
io wait: no
io spike: no
graphite: no
nfs: no

[8765040.476194] INFO: task jbd2/vda3-8:98 blocked for more than 120 seconds.
[8765040.484287]       Not tainted 3.19.0-2-amd64 #1
[8765040.484878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.486810] INFO: task nscd:14498 blocked for more than 120 seconds.
[8765040.487564]       Not tainted 3.19.0-2-amd64 #1
[8765040.488145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.489294] INFO: task nscd:14499 blocked for more than 120 seconds.
[8765040.490088]       Not tainted 3.19.0-2-amd64 #1
[8765040.490630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.491619] INFO: task nscd:10330 blocked for more than 120 seconds.
[8765040.492385]       Not tainted 3.19.0-2-amd64 #1
[8765040.492925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.494444] INFO: task atop:17465 blocked for more than 120 seconds.
[8765040.495215]       Not tainted 3.19.0-2-amd64 #1
[8765040.495780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.496870] INFO: task ntpq:21844 blocked for more than 120 seconds.
[8765040.498110]       Not tainted 3.19.0-2-amd64 #1
[8765040.498841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.500203] INFO: task jbd2/vda3-8:98 blocked for more than 120 seconds.
[8765160.505508]       Not tainted 3.19.0-2-amd64 #1
[8765160.507062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.508336] INFO: task nscd:14498 blocked for more than 120 seconds.
[8765160.509006]       Not tainted 3.19.0-2-amd64 #1
[8765160.509483] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.510707] INFO: task nscd:14499 blocked for more than 120 seconds.
[8765160.511385]       Not tainted 3.19.0-2-amd64 #1
[8765160.511871] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.512783] INFO: task nscd:14505 blocked for more than 120 seconds.
[8765160.513454]       Not tainted 3.19.0-2-amd64 #1
[8765160.513945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL (#wikimedia-labs) [2016-09-13T16:14:57Z] <bd808> Rebooting google-api-proxy-02.google-api-proxy.eqiad.wmflabs for T141673

yuvipanda updated the task description. (Show Details)Sep 20 2016, 5:40 PM

fqdn: deployment-aqs01.deployment-prep.eqiad.wmflabs
virt host: labvirt1005.eqiad.wmnet
noticed: 2016-09-26T23:44, shinken check for puppet staleness broken since Thu Sep 22 05:26:39 2016
ssh: yes
iowait: yes
iospike: no, https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1474931431.2&target=deployment-prep.deployment-aqs01.iostat.vda.io&from=17%3A25_20160921&until=17%3A35_20160921 suggests it died at 2016-09-21T17:30 ish
graphite: yes, seems so. though diamond is very unhappy in syslog
nfs: no

[5232812.130426] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=5875 jiffies, g=154209387, c=154209386, q=41)
[5232812.130426] INFO: Stall ended before state dump start
[11512081.913393] INFO: task jbd2/vda3-8:116 blocked for more than 120 seconds.
[11512081.915317]       Not tainted 3.16.0-4-amd64 #1
[11512081.916585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.918022] INFO: task java:2967 blocked for more than 120 seconds.
[11512081.918805]       Not tainted 3.16.0-4-amd64 #1
[11512081.919387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.920792] INFO: task java:2976 blocked for more than 120 seconds.
[11512081.921533]       Not tainted 3.16.0-4-amd64 #1
[11512081.922048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.923026] INFO: task java:3003 blocked for more than 120 seconds.
[11512081.924723]       Not tainted 3.16.0-4-amd64 #1
[11512081.925342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.926364] INFO: task java:3218 blocked for more than 120 seconds.
[11512081.927068]       Not tainted 3.16.0-4-amd64 #1
[11512081.927581] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.928918] INFO: task java:3219 blocked for more than 120 seconds.
[11512081.929626]       Not tainted 3.16.0-4-amd64 #1
[11512081.930141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.931154] INFO: task java:3323 blocked for more than 120 seconds.
[11512081.931893]       Not tainted 3.16.0-4-amd64 #1
[11512081.932425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.932172] INFO: task jbd2/vda3-8:116 blocked for more than 120 seconds.
[11512201.934627]       Not tainted 3.16.0-4-amd64 #1
[11512201.935583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.937320] INFO: task java:2967 blocked for more than 120 seconds.
[11512201.938030]       Not tainted 3.16.0-4-amd64 #1
[11512201.938551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.939570] INFO: task java:2976 blocked for more than 120 seconds.
[11512201.940635]       Not tainted 3.16.0-4-amd64 #1
[11512201.941180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

After I rebooted it, /var/log/syslog has lost all the nonsense it accumulated between roughly the time of it getting stuck, and the restart

Sep 21 17:25:01 deployment-aqs01 CRON[27843]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep 26 23:17:31 deployment-aqs01 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="593" x-info="http://www.rsyslog.com"] start
Sep 26 23:17:31 deployment-aqs01 systemd-modules-load[145]: Inserted module 'act_mirred'
Sep 26 23:17:31 deployment-aqs01 systemd-modules-load[145]: Inserted module 'ifb'
Sep 26 23:17:31 deployment-aqs01 systemd[1]: Started Load Kernel Modules.

Mentioned in SAL (#wikimedia-labs) [2016-09-26T23:13:57Z] <Krenair> Rebooting deployment-aqs01 for T141673

T146947
fqdn: deployment-ms-be01.deployment-prep.eqiad.wmflabs
virt host: labvirt1010.eqiad.wmnet
noticed: 2016-09-28T21:30ish, shinken check for puppet run broken since Wed Sep 28 19:02:04 2016
ssh: no, it gets as far as debug3: send packet: type 50 but there is no debug3: receive packet: type 51
salt: no
iowait: no, it was 0 until stopping at
iospike: no
graphite data: no
nfs: no
nothing in console

Mentioned in SAL (#wikimedia-labs) [2016-09-28T22:33:19Z] <Krenair> Rebooting deployment-ms-be01 - T146947, T141673

I rebooted tools-elasticsearch-02 today for this. I was trying to get stashbot/sal back up and running so I didn't debug any farther than ssh was hung both as my user and as root@ and that the elasticsearch process was also hung and blocking the 3-node cluster from working at all.

https://phabricator.wikimedia.org/T149072#2741012 is probably this too. Given that it looks like we don't have the resources to actually investigate this and this status quo is going to persist for a while, I see no point in actively tracking this. I'm just going to reboot now.

hashar created subtask T152599: Instance deadlocking in June 2016.Dec 7 2016, 12:49 PM

hashar closed subtask T152599: Instance deadlocking in June 2016 as Resolved.Dec 7 2016, 12:52 PM

Potentially this one is solved for good. I closed the umbrella task I had (T152599) and haven't noticed such hang for a while now.

There is a subtask about upgrading emu on labvirts T142866. But maybe it has been upgraded as well.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:41 PM

Got solved. Was most probably a kernel bug of some sort, see T152599 for traces etc.

Restricted Application added a project: Release-Engineering-Team (Kanban). · View Herald TranscriptJun 12 2017, 9:25 PM

• Phabricator_maintenance edited projects, added RelEng-Archive-FY201718-Q1; removed Release-Engineering-Team (Kanban), Cloud-VPS.Sep 26 2017, 11:47 PM

• Bstorm mentioned this in T217086: Investigate why the new Son of Grid Engine grid landed in a worse state when NFS was filled than the old Sun Grid Engine grid did.Feb 25 2019, 11:08 PM

• JHedden closed subtask T142866: Upgrade qemu on labvirts as Resolved.Oct 23 2019, 3:42 PM

	F4399712: Screen Shot 2016-08-25 at 8.13.57 AM.png
	Aug 25 2016, 1:20 PM

	F4399719: Screen Shot 2016-08-25 at 8.17.37 AM.png
	Aug 25 2016, 1:20 PM

	F4399694: Screen Shot 2016-08-25 at 8.06.18 AM.png
	Aug 25 2016, 1:20 PM

	F4399702: Screen Shot 2016-08-25 at 8.09.04 AM.png
	Aug 25 2016, 1:20 PM

Track labs instances hanging Closed, ResolvedPublicActions