scfc@tools-bastion-01:~$ ssh tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs ssh_exchange_identification: Connection closed by remote host scfc@tools-bastion-01:~$
Description
Event Timeline
Odd. Graphite doesn't seem to have /any/ recent information, and shinken is not complaining about anything. Has this host ever been online to begin with...?
example:
where NULL is plotted as 0. Seems pretty dead to me...
Nothing obvious in the console log either:
Cloud-init v. 0.7.5 finished at Mon, 02 Nov 2015 22:07:58 +0000. Datasource DataSourceOpenStack [net,ver=2]. Up 183.07 seconds [ 480.332043] INFO: task dpkg:11688 blocked for more than 120 seconds. [ 480.333322] Not tainted 3.13.0-62-generic #102-Ubuntu [ 480.333944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.332052] INFO: task kworker/u8:1:67 blocked for more than 120 seconds. [ 600.333185] Not tainted 3.13.0-62-generic #102-Ubuntu [ 600.333763] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.334696] INFO: task jbd2/vda1-8:154 blocked for more than 120 seconds. [ 600.335443] Not tainted 3.13.0-62-generic #102-Ubuntu [ 600.336050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.336933] INFO: task dpkg:11688 blocked for more than 120 seconds. [ 600.337587] Not tainted 3.13.0-62-generic #102-Ubuntu [ 600.338164] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.339002] INFO: task sudo:11700 blocked for more than 120 seconds. [ 600.339679] Not tainted 3.13.0-62-generic #102-Ubuntu [ 600.340279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 600.341124] INFO: task ntpq:11701 blocked for more than 120 seconds. [ 600.341799] Not tainted 3.13.0-62-generic #102-Ubuntu [ 600.342376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.340030] INFO: task kworker/u8:1:67 blocked for more than 120 seconds. [ 720.341364] Not tainted 3.13.0-62-generic #102-Ubuntu [ 720.341944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.342863] INFO: task jbd2/vda1-8:154 blocked for more than 120 seconds. [ 720.343592] Not tainted 3.13.0-62-generic #102-Ubuntu [ 720.344197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.345064] INFO: task rs:main Q:Reg:2410 blocked for more than 120 seconds. [ 720.345786] Not tainted 3.13.0-62-generic #102-Ubuntu [ 720.346353] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 720.347636] INFO: task dpkg:11688 blocked for more than 120 seconds. [ 720.348314] Not tainted 3.13.0-62-generic #102-Ubuntu [ 720.348922] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
The host is responding to pings.
IRC logs give another clue:
[22:04:35] <YuviPanda> !log tools created tools-webgrid-lighttpd-1412 and 1413 (...) [22:35:20] <shinken-wm> RECOVERY - Puppet failure on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [22:37:05] <YuviPanda> Coren: andrewbogott can you attempt to ssh into tools-webgrid-lighttpd-1412.eqiad.wmflabs [22:37:10] <YuviPanda> I created that instance but seems stuck? (...) [22:39:51] <Coren> Ah; userland stuck - sshd accepts() but then stalls. [22:40:52] <andrewbogott> in (possibly related) news, I’m trying to rsync stuff to labvirt1005 and getting "failed verification -- update discarded (will try again)" [22:40:56] <andrewbogott> which is… worrying. [22:41:01] <andrewbogott> So I’m going to depool it and do some more tests. [22:41:07] <chasemp> kk [22:41:17] * Coren hmms. [22:41:24] <andrewbogott> YuviPanda, Coren, possibly related since 1412 is also on 1005
So I think the host just never has been alive. I'll try rebooting it, and see if that brings it back up online. If so, I think it still needs to be further configured. @yuvipanda?
I think this host was stillborn and I kept it alive to see if @andrewbogott wanted to investigate it, but I think it's ok to just kill it and rebirth it now. This was due to disk issues on the underlying hosts when this was started.
I'll do this on Monday if nobody gets to it by then!