Page MenuHomePhabricator

tools-webgrid-lighttpd-1204 locked up
Closed, ResolvedPublic

Description

Another victim of the random lockup issue?

Last output in the console log:

cloud-init boot finished at Wed, 30 Dec 2015 03:15:54 +0000. Up 12.21 seconds

Ubuntu 12.04.5 LTS tools-webgrid-lighttpd-1204 ttyS0

tools-webgrid-lighttpd-1204 login: [296400.308160] INFO: task php-cgi:27643 blocked for more than 120 seconds.
[296400.313289] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[298680.312168] INFO: task lighttpd:22388 blocked for more than 120 seconds.
[298680.313741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[299520.312162] INFO: task php-cgi:27643 blocked for more than 120 seconds.
[299520.315794] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[299520.317252] INFO: task php-cgi:27747 blocked for more than 120 seconds.
[299520.318276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[299520.319270] INFO: task php-cgi:6481 blocked for more than 120 seconds.
[299520.320061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1500960.820584] INFO: task lighttpd:18320 blocked for more than 120 seconds.
[1500960.825800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1501080.828391] INFO: task lighttpd:18320 blocked for more than 120 seconds.
[1501080.833014] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1501200.833531] INFO: task lighttpd:18320 blocked for more than 120 seconds.
[1501200.836835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1501320.836219] INFO: task lighttpd:18320 blocked for more than 120 seconds.
[1501320.840029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1501440.840822] INFO: task lighttpd:18320 blocked for more than 120 seconds.
[1501440.844371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Can't login from tools-bastion-01, nor with my root key. Rebooting it.

From kern.log:

Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.820584] INFO: task lighttpd:18320 blocked for more than 120 seconds.
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.825800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.828795] lighttpd        D 0000000000000000     0 18320  18318 0x00000000
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.828802]  ffff8800b9d39c98 0000000000000086 0000000017759b13 ffffffff81c17540
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.828810]  ffff8800b9d39fd8 ffff8800b9d39fd8 ffff8800b9d39fd8 0000000000012700
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.828819]  ffff880232128000 ffff8800976a96f0 ffff8800b9d39c78 ffff88023fc12fc0
Jan 16 12:11:43 tools-webgrid-lighttpd-1204 kernel: [1500960.828829] Call Trace:
(...)
Jan 16 12:13:43 tools-webgrid-lighttpd-1204 kernel: [1501080.828391] INFO: task lighttpd:18320 blocked for more than 120 seconds.
(...)
Jan 16 12:15:43 tools-webgrid-lighttpd-1204 kernel: [1501200.833531] INFO: task lighttpd:18320 blocked for more than 120 seconds.
(...)
Jan 16 12:17:43 tools-webgrid-lighttpd-1204 kernel: [1501320.836219] INFO: task lighttpd:18320 blocked for more than 120 seconds.
(...)
Jan 16 12:19:43 tools-webgrid-lighttpd-1204 kernel: [1501440.840822] INFO: task lighttpd:18320 blocked for more than 120 seconds.
(...)
Jan 16 12:19:43 tools-webgrid-lighttpd-1204 kernel: [1501440.845503]  [<ffffffff81666982>] system_call_fastpath+0x16/0x1b

All later entries are from the reboot (~14:39)

/var/log/auth.log shows the server was still doing things in that period. I can even see my connection attempts:

Jan 16 14:37:40 tools-webgrid-lighttpd-1204 sshd[21694]: Set /proc/self/oom_score_adj to 0
Jan 16 14:37:40 tools-webgrid-lighttpd-1204 sshd[21694]: Connection from 10.68.17.228 port 54394 on 10.68.18.49 port 22
Jan 16 14:38:10 tools-webgrid-lighttpd-1204 sshd[21696]: Set /proc/self/oom_score_adj to 0
Jan 16 14:38:10 tools-webgrid-lighttpd-1204 sshd[21696]: Connection from 10.68.17.232 port 45459 on 10.68.18.49 port 22
(...)
Jan 16 14:38:15 tools-webgrid-lighttpd-1204 sshd[21696]: error: AuthorizedKeysCommand /usr/sbin/ssh-key-ldap-lookup returned status 1
Jan 16 14:38:15 tools-webgrid-lighttpd-1204 sshd[21696]: Accepted publickey for root from 10.68.17.232 port 45459 ssh2: RSA 42:3a:c1:ab:15:5e:22:71:39:c6:a0:65:fe:a9:d2:2b
Jan 16 14:38:15 tools-webgrid-lighttpd-1204 sshd[21696]: pam_unix(sshd:session): session opened for user root by (uid=0)

(!) so the authentication succeeds, but then the connection just hangs?

Same issue today on tools-webgrid-lighttpd-1412 (T124038):

[1691760.896155] INFO: task jbd2/vda1-8:175 blocked for more than 120 seconds.
[1691760.902121]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691760.903224] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691760.905020] INFO: task kworker/u8:2:7054 blocked for more than 120 seconds.
[1691760.906295]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691760.907272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.908100] INFO: task init:1 blocked for more than 120 seconds.
[1691880.911512]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.911959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.912751] INFO: task jbd2/vda1-8:175 blocked for more than 120 seconds.
[1691880.913308]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.913756] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.914463] INFO: task jbd2/dm-1-8:334 blocked for more than 120 seconds.
[1691880.915002]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.915456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.916200] INFO: task kworker/u8:2:7054 blocked for more than 120 seconds.
[1691880.916757]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.917204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.917964] INFO: task cron:18501 blocked for more than 120 seconds.
[1691880.918478]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.918924] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.919642] INFO: task sort:18520 blocked for more than 120 seconds.
[1691880.920159]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.920606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.921276] INFO: task sort:18540 blocked for more than 120 seconds.
[1691880.921781]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.922686] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1691880.923361] INFO: task sort:18544 blocked for more than 120 seconds.
[1691880.923869]       Not tainted 3.13.0-62-generic #102-Ubuntu
[1691880.924358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I rebooted that instance.

chasemp claimed this task.
chasemp subscribed.

It's currently up and going and I don't have more details to add here so I'm going to close the subtask.