Page MenuHomePhabricator

tools-sgewebgrid-lighttpd-0915 not responding
Closed, ResolvedPublic

Description

the VM tools-sgewebgrid-lighttpd-0915 is currently offline and not responding.

PROBLEM - SSH on tools-sgewebgrid-lighttpd-0915 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Debian GNU/Linux 9 tools-sgewebgrid-lighttpd-0915 ttyS0

tools-sgewebgrid-lighttpd-0915 login: [9796737.825202] INFO: task prometheus-node:417 blocked for more than 120 seconds.
[9796737.827194]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796737.828751] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9796737.830971] INFO: task prometheus-node:442 blocked for more than 120 seconds.
[9796737.832802]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796737.834291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9796737.836312] INFO: task prometheus-node:1667 blocked for more than 120 seconds.
[9796737.837534]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796737.838572] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9796737.839892] INFO: task prometheus-node:1668 blocked for more than 120 seconds.
[9796737.841119]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796737.842129] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9796979.474344] INFO: task sd-resolve:402 blocked for more than 120 seconds.
[9796979.476146]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796979.477578] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9796979.479653] INFO: task perl:25169 blocked for more than 120 seconds.
[9796979.481319]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9796979.482768] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9797100.300964] INFO: task sd-resolve:402 blocked for more than 120 seconds.
[9797100.302727]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9797100.304160] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9797100.306184] INFO: task lighttpd:14195 blocked for more than 120 seconds.
[9797100.307813]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9797100.309262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[9797100.311362] INFO: task perl:25169 blocked for more than 120 seconds.
[9797100.312906]       Not tainted 4.9.0-8-amd64 #1 Debian 4.9.144-3
[9797100.314342] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2019-08-08T19:25:58Z] <jeh> restarting tools-sgewebgrid-lighttpd-0915 T230157

Only interesting things from the logs:

Aug  8 19:08:48 tools-sgewebgrid-lighttpd-0915 kernel: [9796402.628987] perl[12588]: segfault at 7ffcde3fdff8 ip 00002b058f2d3458 sp 00007ffcde3fe000 error 6 in libc-2.24.so[2b058f25a000+195000]

Error 6 is for user-mode writes that result in no page being found.

This VM is running on cloudvirt1027.eqiad.wmnet. No other VMs were effected or system errors on the hypervisor. I think it's an application error in one of the grid jobs, but if it happens again on this host we may want to bring it down for memtest.