Page MenuHomePhabricator

Track labs instances hanging
Closed, ResolvedPublic

Description

This is purely to track labs instances hanging with messages about hung tasks in kernel / console log. There is a subtask: T140256 for tools k8s and T141017 for k8s worker nodes. T124133 also has more information.

InstanceOSvirt hosttimesshiowaitio spikegraphite datanfsrecovery
deployment-logstash2Jessielabvirt101027 jul 09:00Znonononono?
tools-merlbot-proxyJessielabvirt101031 jul 01:00Znoyesregular spikes to 300 (typically 50)noyes?rebooted
novaproxy-01Jessielabvirt10012 Aug ~21:15Znonononoyesreboot
librarybase-reston-01Jessielabvirt100607 Jul ~0200Znononononoreboot
tools-grid-mastertrustylabvirt10102 Aug ~14:00Znoyesnonoyesslow and drawn out reboot
tools-webgrid-lighttpd-1202preciselabvirt10199 Aug ~05:47Znonononoyesreboot
extdist-01jessielabvirt1009Aug 17 10:19:37 (a)no responseyesyes (regular, still happening)yesnoreboot
extdist-02jessielabvirt1010Jul 23 03:19:11 (a)hangs after handshakenoyes (single)nonoreboot
tools-mailpreciselabvirt1009Aug 19 00:35:01 (a)hangs after handshakeno (max 1)yes (jumps to steady 15k)noyesslow reboot
tools-exec-1217preciselabvirt100124 Aug ~04 AM UTCnopenopenonoyes???

(a) last readable entry in /var/log/syslog before reboot

tools-puppetmaster-01jessielabvirt1001Sep 20 13:54:01yes??????noreboot
Legend

Instance: Name of instance
virt host: the virt host the instance is hosted in (you can find this info on wikitech.wikimedia.org/wiki/Nova_Resource:$fqdn)
time: Time at which the host hung
ssh: If you could ssh into it still
iowait: If the host had high iowait (you can find this from https://graphite-labs.wikimedia.org/), e.g. tools.tools-merlbot-proxy.cpu.total.iowait
--> https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1469982906.471&target=tools.tools-merlbot-proxy.cpu.total.iowait

iospike: If the host had io spikes (you can find this from https://graphite-labs.wikimedia.org/), e.g. tools.tools-merlbot-proxy.iostat.vda.io. vdb, ..., might also exist?
--> https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1469983050.092&target=tools.tools-merlbot-proxy.iostat.vda.io

graphite data: If the host is sending graphite data even after it hung
nfs: If the host had NFS mounted

Event Timeline

Before deployment-logstash2 was restarted for the 2016-07-27 hang it had this console output via wikitech:

1[10847049.612128] INFO: task jbd2/vda3-8:159 blocked for more than 120 seconds.
2[10847049.618465] Not tainted 3.16.0-4-amd64 #1
3[10847049.619349] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
4[10847049.621800] INFO: task jbd2/dm-0-8:308 blocked for more than 120 seconds.
5[10847049.623079] Not tainted 3.16.0-4-amd64 #1
6[10847049.623978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
7[10847049.625684] INFO: task nscd:15726 blocked for more than 120 seconds.
8[10847049.626891] Not tainted 3.16.0-4-amd64 #1
9[10847049.628953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
10[10847049.629787] INFO: task java:10739 blocked for more than 120 seconds.
11[10847049.630349] Not tainted 3.16.0-4-amd64 #1
12[10847049.630765] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
13[10847049.631580] INFO: task java:10747 blocked for more than 120 seconds.
14[10847049.632347] Not tainted 3.16.0-4-amd64 #1
15[10847049.632751] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
16[10847049.633550] INFO: task java:10833 blocked for more than 120 seconds.
17[10847049.634100] Not tainted 3.16.0-4-amd64 #1
18[10847049.634505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
19[10847049.635257] INFO: task java:10848 blocked for more than 120 seconds.
20[10847049.635833] Not tainted 3.16.0-4-amd64 #1
21[10847049.636426] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
22[10847049.637165] INFO: task java:11608 blocked for more than 120 seconds.
23[10847049.637719] Not tainted 3.16.0-4-amd64 #1
24[10847049.638121] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
25[10847049.638944] INFO: task java:15282 blocked for more than 120 seconds.
26[10847049.639505] Not tainted 3.16.0-4-amd64 #1
27[10847049.640721] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
28[10847049.641516] INFO: task <tcp:15292 blocked for more than 120 seconds.
29[10847049.642078] Not tainted 3.16.0-4-amd64 #1
30[10847049.642489] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

valhallasw subscribed.

Can you clarify what the columns mean? tools-merlbot-proxy is currently hanging with

https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=consoleoutput&instanceid=7b12cce8-b045-4fed-96a8-f5fe1a11e0f9&project=tools&region=eqiad

[4518960.281832] INFO: task jbd2/vda3-8:107 blocked for more than 120 seconds.
[4518960.282912]       Not tainted 3.19.0-2-amd64 #1
[4518960.283470] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

This issue seems to be prevalent and maybe limited to Jessie hosts but not Jessie hosts built from the same exact image. It has seemed to affect k8s infrastructure but it's entirely possible that a) k8s is the majority of jessie hosts in Tools and b) we notice Tool hosts far more proactively than Jessie hosts anywhere else. We really need T130806 as most periods of unavailable also come with a lack of atop, syslog, and access that is making all of this more difficulty still. We had established a basic logging aggregation node for k8s (T141270) as it seemed so focused there but with novaproxy-01 going down today and displaying such similar symptoms it may be wider (and we missed it not having all of Tools sending in logs). It's also interesting CI had (what I take to be) really similar Jessie related issues that mysteriously disappeared(?) in T138281

Console from novaproxy-01 at the time of unavailability

[11409409.150117] INFO: task cron:14372 blocked for more than 120 seconds.
[11409409.151003]       Not tainted 3.16.0-4-amd64 #1
[11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds.
[11409409.153825]       Not tainted 3.16.0-4-amd64 #1
[11409409.155729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11409409.156902] INFO: task cron:14374 blocked for more than 120 seconds.
[11409409.157905]       Not tainted 3.16.0-4-amd64 #1
[11409409.159943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

tools-grid-master, first trusty instance to be dead! It also saw super spikes in load and iowait before it died, *but* the processes on it are still functioning, with the grid still operating normally.

deployment-cache-upload04, still successfully running upload.beta.wmflabs.org on HTTP/HTTPS and responding to pings, but doesn't respond to SSH or Salt.

So I dug in and grepped all the logs for all active instances from openstack, and the following 33 instances have some form or other of:

[12156856.840147] INFO: task log-log.tcl:7702 blocked for more than 120 seconds.

for various tasks in them.

uploadwizard-osm-equiad
etcd03
tools-bastion-03
phabricator
tools-exec-1207
extdist-02
tools-exec-1212
tools-exec-1219
tools-exec-1217
otrs-memoryleak
openvas
filippo-test-precise2
tools-exec-1216
test-prometheus5
jessie-bastion-01
confd-precise
filippo-test-jessie
tools-exec-1204
deployment-cache-upload04
wpx-prod-01
toolsbeta-proxy-02
udpmx-01
mdc2002
db2003
ifttt-staging
tools-exec-1211
tools-k8s-etcd-03
ores-web-04
mc2001
diff-forge1
druid103
tools-exec-1220
filippo-test-jessie2

I'm going to try to figure out how many of them have working ssh now...

I can only ssh with my root key into the following instances:

  1. druid103.eqiad.wmflabs
  2. tools-exec-1220.eqiad.wmflabs
  3. tools-bastion-03.eqiad.wmflabs
  4. tools-exec-1207.eqiad.wmflabs
  5. tools-exec-1212.eqiad.wmflabs
  6. tools-exec-1216.eqiad.wmflabs
  7. tools-exec-1217.eqiad.wmflabs
  8. tools-exec-1219.eqiad.wmflabs
  9. tools-exec-1219.eqiad.wmflabs
  10. tools-exec-1220.eqiad.wmflabs
  11. tools-bastion-03.eqiad.wmflabs
  12. tools-exec-1207.eqiad.wmflabs
  13. tools-exec-1212.eqiad.wmflabs
  14. tools-exec-1216.eqiad.wmflabs
  15. tools-exec-1217.eqiad.wmflabs
  16. tools-exec-1219.eqiad.wmflabs

so almost exactly half?!

Looking at distribution of labvirts...

root@labcontrol1001:/home/yuvipanda/20160811# cat hosts | sort | uniq -c
      3 labvirt1001.eqiad.wmnet
      1 labvirt1002.eqiad.wmnet
      2 labvirt1004.eqiad.wmnet
      2 labvirt1006.eqiad.wmnet
      3 labvirt1007.eqiad.wmnet
      8 labvirt1008.eqiad.wmnet
      1 labvirt1009.eqiad.wmnet
     10 labvirt1010.eqiad.wmnet
      3 labvirt1011.eqiad.wmnet

And for images / OS...

16 debian-8.*-jessie
11 ubuntu-12.04-precise
 6 ubuntu-14.04-trusty

Mentioned in SAL [2016-08-12T19:14:49Z] <Krenair> rebooting deployment-cache-upload04, it's stuck in https://phabricator.wikimedia.org/T141673 and varnish is no longer working there afaict, so trying to bring upload.beta.wmflabs.org back up

Actual ssh failing just now in:

(clush)yuvipanda@fearless:~/code/clush$ clush -l root --hostfile=files -b 'echo hi'
extdist-02.eqiad.wmflabs: ssh_exchange_identification: Connection closed by remote host
confd-precise.eqiad.wmflabs: Connection timed out during banner exchange
db2003.eqiad.wmflabs: Connection timed out during banner exchange
etcd03.eqiad.wmflabs: Connection timed out during banner exchange
filippo-test-jessie2.eqiad.wmflabs: Connection timed out during banner exchange
filippo-test-precise2.eqiad.wmflabs: Connection timed out during banner exchange
ifttt-staging.eqiad.wmflabs: Connection timed out during banner exchange
jessie-bastion-01.eqiad.wmflabs: Connection timed out during banner exchange
mc2001.eqiad.wmflabs: Connection timed out during banner exchange
mdc2002.eqiad.wmflabs: Connection timed out during banner exchange
toolsbeta-proxy-02.eqiad.wmflabs: Connection timed out during banner exchange
openvas.eqiad.wmflabs: Connection timed out during banner exchange
ores-web-04.eqiad.wmflabs: Connection timed out during banner exchange
tools-exec-1204.eqiad.wmflabs: packet_write_wait: Connection to UNKNOWN port 0: Broken pipe
uploadwizard-osm-equiad.eqiad.wmflabs: Connection timed out during banner exchange
otrs-memoryleak.eqiad.wmflabs: Connection timed out during banner exchange
test-prometheus5.eqiad.wmflabs: Connection timed out during banner exchange
tools-exec-1211.eqiad.wmflabs: packet_write_wait: Connection to UNKNOWN port 0: Broken pipe
filippo-test-jessie.eqiad.wmflabs: Connection timed out during banner exchange
clush: 30/33

breakdown:

extdist-02 16a0fd3f-c751-4686-8c77-e84b58d42eed debian-8.0-jessie
confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise
db2003 a1c342e1-8725-46a1-a116-74bf39549bcb ubuntu-14.04-trusty
etcd03 06eeac90-a10e-4c7c-8c49-bbb566fc3936 debian-8.0-jessie
filippo-test-jessie2 f8ddd70f-1b05-43c8-863f-55d6c2d1b59e debian-8.3-jessie
filippo-test-precise2 644872c4-db70-411f-8bde-f48346785e11 ubuntu-12.04-precise
ifttt-staging-01 4d70a62f-87f9-47a8-ad09-db7aa58a18b3 debian-8.3-jessie
ifttt-staging b57b5ff1-3619-4666-af91-f9b156566810 debian-8.3-jessie
jessie-bastion-01 6bbdb413-4b3e-44f9-8472-153402b09126 debian-8.3-jessie
mc2001 df6c5ac6-72f3-4c57-96ee-d48d9ff27bab ubuntu-14.04-trusty
mdc2002 9c5daf34-8e63-485c-a0fc-52b3a2443d86 ubuntu-14.04-trusty
toolsbeta-proxy-02 950f686c-eec5-4ecc-b2bc-f192fee4e95c debian-8.1-jessie
openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
ores-web-04 dbd8f69a-caef-4587-859c-40d8aff14e98 debian-8.5-jessie
tools-exec-1204 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 ubuntu-12.04-precise
uploadwizard-osm-equiad 045fda94-73f2-4e58-a315-e26fbe268970 ubuntu-12.04-precise
otrs-memoryleak 46259ec3-a3ae-4129-97f6-21bd564f4a31 debian-8.3-jessie
test-prometheus5 67e83f10-208d-429d-ba24-60c6abf589e6 debian-8.3-jessie
tools-exec-1211 bfc9c147-cb7b-4c16-8d92-b37524605a12 ubuntu-12.04-precise
filippo-test-jessie2 f8ddd70f-1b05-43c8-863f-55d6c2d1b59e debian-8.3-jessie
filippo-test-jessie 7c38738f-babe-43a8-b9bc-18761f2a6fd2 debian-8.0-jessie

of these I have looked at tools-exec-1211 and tools-exec-1204 so far which according to nova diagnostics are still processing some io making them probably not a candidate for the same issue.

nonjessie candidates

confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise
db2003 a1c342e1-8725-46a1-a116-74bf39549bcb ubuntu-14.04-trusty
filippo-test-precise2 644872c4-db70-411f-8bde-f48346785e11 ubuntu-12.04-precise
mc2001 df6c5ac6-72f3-4c57-96ee-d48d9ff27bab ubuntu-14.04-trusty
mdc2002 9c5daf34-8e63-485c-a0fc-52b3a2443d86 ubuntu-14.04-trusty
openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
tools-exec-1204 82323ee4-762e-4b1f-87a7-d7aa7afa22f6 ubuntu-12.04-precise
uploadwizard-osm-equiad 045fda94-73f2-4e58-a315-e26fbe268970 ubuntu-12.04-precise
tools-exec-1211 bfc9c147-cb7b-4c16-8d92-b37524605a12 ubuntu-12.04-precise

inconclusive so far are:

openvas 49fb74ed-734a-4332-9fb3-6d2c4903e98e ubuntu-14.04-trusty
confd-precise 7966c5ea-ac06-47bd-9f0b-b6c142de1650 ubuntu-12.04-precise

Change 304942 had a related patch set uploaded (by Yuvipanda):
labs: Enable libvirtkvm collector

https://gerrit.wikimedia.org/r/304942

Change 304942 merged by Yuvipanda:
labs: Enable libvirtkvm collector

https://gerrit.wikimedia.org/r/304942

@Krenair just told me deployment-redis02 is in a similar state, and I see there has been no changes in the disk io numbers from nova for a while. It's a trusty instance with no NFS.

Caught yesterday

Candidate: tools-exec-1217

Details: stuck or frozen and can do no IO (amount other things such as SSH inaccessible completely)

I used virsh domblkstat i-00000bde to diagnose the io stuck

I looked into the process on labvirt1001 and documented ouput here: https://phabricator.wikimedia.org/P3886. @BBlack was kind enough to lend another set of hands and eyes but we were unable to determine the cause of the process lockup.

  • when I try to attach to the process gdb I see /usr/bin/qemu-system-x86_64 (deleted): No such file or directory

:* I do not see the above when connecting to well behaved instance on labvirt1014
:* I am able to do gdb /proc/62648/exe -p 62648

  • atop says integration-slave-trusty-1013 is a pretty big disk user and even around that time frame but that should be small potatoes
  • A virsh reboot did not return or restore the host. A virsh destroy (yank power plug) was required followed by a virsh start.

Seems to have become unavailable at 4 a.m. local time:

Screen Shot 2016-08-25 at 8.06.18 AM.png (533×897 px, 66 KB)

No noticeable usage tiggers in the instance itself:

https://graphite-labs.wikimedia.org/render/?width=901&height=535&_salt=1472072883.003&target=tools.tools-exec-1217.loadavg.05&target=tools.tools-exec-1217.cpu.total.iowait&target=tools.tools-exec-1217.iostat.vda.iops&target=tools.tools-exec-1217.iostat.vda.io&from=-2d

Screen Shot 2016-08-25 at 8.09.04 AM.png (534×897 px, 46 KB)

there is a load surge several hours before on the labvirt but it doesn't seem to be huge.

Screen Shot 2016-08-25 at 8.13.57 AM.png (535×897 px, 92 KB)

no memory usage correlation on the underlying host that I can tell

Screen Shot 2016-08-25 at 8.17.37 AM.png (536×899 px, 85 KB)

Log messages of note:

from labvirtd

2016-08-24 17:25:00.553+0000: 16223: error : qemuDomainGetIOThreadsLive:5776 : unsupported configuration: IOThreads not supported with this binary

(it's possible that iothreads is me trying to issue: virsh iothreadinfo i-00000bde)

016-08-24 19:24:12.352+0000: 16226: warning : qemuDomainObjBeginJobInternal:1467 : Cannot start job (query, none) for domain i-00000bde; current job is (query, none) owned by (16222 remoteDispatchDomainBlockStats, 0 <null>) for (48s, 0s)

[Wed Aug 24 18:03:10 2016] WARNING: CPU: 45 PID: 12767 at /build/linux-lts-utopic-9LCANk/linux-lts-utopic-3.16.0/security/apparmor/label.c:1767 __aa_labelset_update_all+0x48d/0x700()

Wed Aug 24 18:03:10 2016] AppArmor WARN label_update: ((aa_label_remove_and_insert((&(((label)->ent[(label)->size - 1])->ns)->labels), label, l) != l)):

[16:27:05] <bblack> so, the last thing I'm still looking at is the strace on 64658
[16:27:35] <bblack> the pattern there is several fast KVM_RUN invocations with this in the middle of them:
[16:27:38] <bblack> ioctl(15, KVM_RUN, 0)                   = 0
[16:27:40] <bblack> write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
[16:27:43] <bblack> futex(0x7f44fa5e89a0, FUTEX_WAKE_PRIVATE, 1) = 1
[16:27:45] <bblack> ioctl(15, KVM_RUN, 0)                   = 0
[16:27:52] <bblack> and then it eventually hangs in a KVM_RUN ioctl
[16:27:55] <chasemp> I saw that too but can't make heads or tails of it
[16:28:02] <bblack> and after the pausy/hangy one, it always does:
[16:28:08] <bblack> ioctl(10, KVM_IRQ_LINE_STATUS, 0x7f44e979fa30) = 0
[16:28:13] <chasemp> hmmm
[16:28:28] <bblack> I've seen some bug pointers on that, but again they seem old
[16:28:40] <bblack> one mentioned that they fixed a similar bug by upgrading the guest kernel, because of some bug in it
[16:28:56] <bblack> but unlikely the case if you're seeing this on jessie+trusty+precise guests

considered:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1378123
https://bugs.launchpad.net/nova/+bug/1254872
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540/comments/13

Just happened to tools-k8s-etcd-01. Given it's a friday evening and I'm trying really hard to not work this weekend, I've just migrated it to a different host and rebooted it.

google-api-proxy-02.google-api-proxy.eqiad.wmflabs
noticed: 2016-09-13T16:00
ssh: no
io wait: no
io spike: no
graphite: no
nfs: no

[8765040.476194] INFO: task jbd2/vda3-8:98 blocked for more than 120 seconds.
[8765040.484287]       Not tainted 3.19.0-2-amd64 #1
[8765040.484878] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.486810] INFO: task nscd:14498 blocked for more than 120 seconds.
[8765040.487564]       Not tainted 3.19.0-2-amd64 #1
[8765040.488145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.489294] INFO: task nscd:14499 blocked for more than 120 seconds.
[8765040.490088]       Not tainted 3.19.0-2-amd64 #1
[8765040.490630] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.491619] INFO: task nscd:10330 blocked for more than 120 seconds.
[8765040.492385]       Not tainted 3.19.0-2-amd64 #1
[8765040.492925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.494444] INFO: task atop:17465 blocked for more than 120 seconds.
[8765040.495215]       Not tainted 3.19.0-2-amd64 #1
[8765040.495780] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765040.496870] INFO: task ntpq:21844 blocked for more than 120 seconds.
[8765040.498110]       Not tainted 3.19.0-2-amd64 #1
[8765040.498841] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.500203] INFO: task jbd2/vda3-8:98 blocked for more than 120 seconds.
[8765160.505508]       Not tainted 3.19.0-2-amd64 #1
[8765160.507062] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.508336] INFO: task nscd:14498 blocked for more than 120 seconds.
[8765160.509006]       Not tainted 3.19.0-2-amd64 #1
[8765160.509483] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.510707] INFO: task nscd:14499 blocked for more than 120 seconds.
[8765160.511385]       Not tainted 3.19.0-2-amd64 #1
[8765160.511871] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[8765160.512783] INFO: task nscd:14505 blocked for more than 120 seconds.
[8765160.513454]       Not tainted 3.19.0-2-amd64 #1
[8765160.513945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Mentioned in SAL (#wikimedia-labs) [2016-09-13T16:14:57Z] <bd808> Rebooting google-api-proxy-02.google-api-proxy.eqiad.wmflabs for T141673

fqdn: deployment-aqs01.deployment-prep.eqiad.wmflabs
virt host: labvirt1005.eqiad.wmnet
noticed: 2016-09-26T23:44, shinken check for puppet staleness broken since Thu Sep 22 05:26:39 2016
ssh: yes
iowait: yes
iospike: no, https://graphite-labs.wikimedia.org/render/?width=586&height=308&_salt=1474931431.2&target=deployment-prep.deployment-aqs01.iostat.vda.io&from=17%3A25_20160921&until=17%3A35_20160921 suggests it died at 2016-09-21T17:30 ish
graphite: yes, seems so. though diamond is very unhappy in syslog
nfs: no

[5232812.130426] INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 0, t=5875 jiffies, g=154209387, c=154209386, q=41)
[5232812.130426] INFO: Stall ended before state dump start
[11512081.913393] INFO: task jbd2/vda3-8:116 blocked for more than 120 seconds.
[11512081.915317]       Not tainted 3.16.0-4-amd64 #1
[11512081.916585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.918022] INFO: task java:2967 blocked for more than 120 seconds.
[11512081.918805]       Not tainted 3.16.0-4-amd64 #1
[11512081.919387] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.920792] INFO: task java:2976 blocked for more than 120 seconds.
[11512081.921533]       Not tainted 3.16.0-4-amd64 #1
[11512081.922048] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.923026] INFO: task java:3003 blocked for more than 120 seconds.
[11512081.924723]       Not tainted 3.16.0-4-amd64 #1
[11512081.925342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.926364] INFO: task java:3218 blocked for more than 120 seconds.
[11512081.927068]       Not tainted 3.16.0-4-amd64 #1
[11512081.927581] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.928918] INFO: task java:3219 blocked for more than 120 seconds.
[11512081.929626]       Not tainted 3.16.0-4-amd64 #1
[11512081.930141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512081.931154] INFO: task java:3323 blocked for more than 120 seconds.
[11512081.931893]       Not tainted 3.16.0-4-amd64 #1
[11512081.932425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.932172] INFO: task jbd2/vda3-8:116 blocked for more than 120 seconds.
[11512201.934627]       Not tainted 3.16.0-4-amd64 #1
[11512201.935583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.937320] INFO: task java:2967 blocked for more than 120 seconds.
[11512201.938030]       Not tainted 3.16.0-4-amd64 #1
[11512201.938551] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11512201.939570] INFO: task java:2976 blocked for more than 120 seconds.
[11512201.940635]       Not tainted 3.16.0-4-amd64 #1
[11512201.941180] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

After I rebooted it, /var/log/syslog has lost all the nonsense it accumulated between roughly the time of it getting stuck, and the restart

Sep 21 17:25:01 deployment-aqs01 CRON[27843]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Sep 26 23:17:31 deployment-aqs01 rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="593" x-info="http://www.rsyslog.com"] start
Sep 26 23:17:31 deployment-aqs01 systemd-modules-load[145]: Inserted module 'act_mirred'
Sep 26 23:17:31 deployment-aqs01 systemd-modules-load[145]: Inserted module 'ifb'
Sep 26 23:17:31 deployment-aqs01 systemd[1]: Started Load Kernel Modules.

Mentioned in SAL (#wikimedia-labs) [2016-09-26T23:13:57Z] <Krenair> Rebooting deployment-aqs01 for T141673

T146947
fqdn: deployment-ms-be01.deployment-prep.eqiad.wmflabs
virt host: labvirt1010.eqiad.wmnet
noticed: 2016-09-28T21:30ish, shinken check for puppet run broken since Wed Sep 28 19:02:04 2016
ssh: no, it gets as far as debug3: send packet: type 50 but there is no debug3: receive packet: type 51
salt: no
iowait: no, it was 0 until stopping at
iospike: no
graphite data: no
nfs: no
nothing in console

I rebooted tools-elasticsearch-02 today for this. I was trying to get stashbot/sal back up and running so I didn't debug any farther than ssh was hung both as my user and as root@ and that the elasticsearch process was also hung and blocking the 3-node cluster from working at all.

https://phabricator.wikimedia.org/T149072#2741012 is probably this too. Given that it looks like we don't have the resources to actually investigate this and this status quo is going to persist for a while, I see no point in actively tracking this. I'm just going to reboot now.

Potentially this one is solved for good. I closed the umbrella task I had (T152599) and haven't noticed such hang for a while now.

There is a subtask about upgrading emu on labvirts T142866. But maybe it has been upgraded as well.

hashar claimed this task.

Got solved. Was most probably a kernel bug of some sort, see T152599 for traces etc.