Page MenuHomePhabricator

tools-webgrid-lighttpd-1412 refuses ssh connections
Closed, ResolvedPublic

Description

scfc@tools-bastion-01:~$ ssh tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs
ssh_exchange_identification: Connection closed by remote host
scfc@tools-bastion-01:~$

Event Timeline

scfc raised the priority of this task from to Needs Triage.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc moved this task to Backlog on the Toolforge board.
scfc added a subscriber: scfc.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Odd. Graphite doesn't seem to have /any/ recent information, and shinken is not complaining about anything. Has this host ever been online to begin with...?

example:

pasted_file (308×586 px, 41 KB)

where NULL is plotted as 0. Seems pretty dead to me...

Nothing obvious in the console log either:

Cloud-init v. 0.7.5 finished at Mon, 02 Nov 2015 22:07:58 +0000. Datasource DataSourceOpenStack [net,ver=2].  Up 183.07 seconds
[  480.332043] INFO: task dpkg:11688 blocked for more than 120 seconds.
[  480.333322]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  480.333944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.332052] INFO: task kworker/u8:1:67 blocked for more than 120 seconds.
[  600.333185]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  600.333763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.334696] INFO: task jbd2/vda1-8:154 blocked for more than 120 seconds.
[  600.335443]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  600.336050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.336933] INFO: task dpkg:11688 blocked for more than 120 seconds.
[  600.337587]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  600.338164] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.339002] INFO: task sudo:11700 blocked for more than 120 seconds.
[  600.339679]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  600.340279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  600.341124] INFO: task ntpq:11701 blocked for more than 120 seconds.
[  600.341799]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  600.342376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.340030] INFO: task kworker/u8:1:67 blocked for more than 120 seconds.
[  720.341364]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  720.341944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.342863] INFO: task jbd2/vda1-8:154 blocked for more than 120 seconds.
[  720.343592]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  720.344197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.345064] INFO: task rs:main Q:Reg:2410 blocked for more than 120 seconds.
[  720.345786]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  720.346353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  720.347636] INFO: task dpkg:11688 blocked for more than 120 seconds.
[  720.348314]       Not tainted 3.13.0-62-generic #102-Ubuntu
[  720.348922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The host is responding to pings.

IRC logs give another clue:

[22:04:35] <YuviPanda>	 !log tools created tools-webgrid-lighttpd-1412 and 1413
(...)
[22:35:20] <shinken-wm>	 RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[22:37:05] <YuviPanda>	 Coren: andrewbogott can you attempt to ssh into tools-webgrid-lighttpd-1412.eqiad.wmflabs
[22:37:10] <YuviPanda>	 I created that instance but seems stuck?
(...)
[22:39:51] <Coren>	 Ah; userland stuck - sshd accepts() but then stalls.
[22:40:52] <andrewbogott>	 in (possibly related) news, I’m trying to rsync stuff to labvirt1005 and getting "failed verification -- update discarded (will try again)"
[22:40:56] <andrewbogott>	 which is… worrying.
[22:41:01] <andrewbogott>	 So I’m going to depool it and do some more tests.
[22:41:07] <chasemp>	 kk
[22:41:17] * Coren hmms.
[22:41:24] <andrewbogott>	 YuviPanda, Coren, possibly related since 1412 is also on 1005

So I think the host just never has been alive. I'll try rebooting it, and see if that brings it back up online. If so, I think it still needs to be further configured. @yuvipanda?

I think this host was stillborn and I kept it alive to see if @andrewbogott wanted to investigate it, but I think it's ok to just kill it and rebirth it now. This was due to disk issues on the underlying hosts when this was started.

I'll do this on Monday if nobody gets to it by then!

I've recreated and repooled it.