Page MenuHomePhabricator

tools-webgrid-lighttpd-1412 is not accessible by ssh
Closed, ResolvedPublic

Description

ssh connections to tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs hang and then fail:

[tim@passepartout ~]$ ssh tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs
ssh_exchange_identification: Connection closed by remote host
[tim@passepartout ~]$

The console log at https://wikitech.wikimedia.org/wiki/Special:NovaInstance shows:

[…]
Cloud-init v. 0.7.5 finished at Tue, 19 Jan 2016 14:32:10 +0000. Datasource DataSourceOpenStack [net,ver=2].  Up 17.00 seconds

Ubuntu 14.04.3 LTS tools-webgrid-lighttpd-1412 ttyS0

tools-webgrid-lighttpd-1412 login: [120960.360155] INFO: task jbd2/vda1-8:174 blocked for more than 120 seconds.
[120960.369005]       Not tainted 3.13.0-62-generic #102-Ubuntu
[120960.370015] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121080.368243] INFO: task jbd2/vda1-8:174 blocked for more than 120 seconds.
[121080.373626]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121080.374888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121200.376091] INFO: task jbd2/vda1-8:174 blocked for more than 120 seconds.
[121200.380392]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121200.381395] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121200.383347] INFO: task jbd2/dm-1-8:336 blocked for more than 120 seconds.
[121200.384239]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121200.384947] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121200.386097] INFO: task cron:7934 blocked for more than 120 seconds.
[121200.386944]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121200.387911] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121200.389199] INFO: task cron:7935 blocked for more than 120 seconds.
[121200.389997]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121200.390949] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121320.392238] INFO: task jbd2/vda1-8:174 blocked for more than 120 seconds.
[121320.395941]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121320.396562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121320.397566] INFO: task jbd2/dm-1-8:336 blocked for more than 120 seconds.
[121320.398329]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121320.399177] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121320.400201] INFO: task kworker/u8:0:29826 blocked for more than 120 seconds.
[121320.400952]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121320.401538] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[121320.402498] INFO: task cron:7934 blocked for more than 120 seconds.
[121320.403244]       Not tainted 3.13.0-62-generic #102-Ubuntu
[121320.403833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Event Timeline

scfc created this task.Jan 21 2016, 4:42 PM
scfc raised the priority of this task from to High.
scfc updated the task description. (Show Details)
scfc added subscribers: chasemp, MZMcBride, scfc and 3 others.

blargh missing host from salt and without prior console setup I'm stuck.

I did grab the running jobs on this host

1root@tools-grid-master:~# qhost -j -xml | grep -A 100 tools-webgrid-lighttpd-1412
2 <host name='tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs'>
3 <hostvalue name='arch_string'>lx26-amd64</hostvalue>
4 <hostvalue name='num_proc'>4</hostvalue>
5 <hostvalue name='load_avg'>-</hostvalue>
6 <hostvalue name='mem_total'>7.8G</hostvalue>
7 <hostvalue name='mem_used'>-</hostvalue>
8 <hostvalue name='swap_total'>23.9G</hostvalue>
9 <hostvalue name='swap_used'>-</hostvalue>
10 <job name='2480731'>
11 <jobvalue jobid='2480731' name='priority'>'0.304877'</jobvalue>
12 <jobvalue jobid='2480731' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
13 <jobvalue jobid='2480731' name='job_name'>lighttpd-file-siblings</jobvalue>
14 <jobvalue jobid='2480731' name='job_owner'>tools.file-siblings</jobvalue>
15 <jobvalue jobid='2480731' name='job_state'>r</jobvalue>
16 <jobvalue jobid='2480731' name='start_time'>1453213997</jobvalue>
17 <jobvalue jobid='2480731' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
18 <jobvalue jobid='2480731' name='pe_master'>MASTER</jobvalue>
19 </job>
20 <job name='2480732'>
21 <jobvalue jobid='2480732' name='priority'>'0.304877'</jobvalue>
22 <jobvalue jobid='2480732' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
23 <jobvalue jobid='2480732' name='job_name'>lighttpd-autolist</jobvalue>
24 <jobvalue jobid='2480732' name='job_owner'>tools.autolist</jobvalue>
25 <jobvalue jobid='2480732' name='job_state'>r</jobvalue>
26 <jobvalue jobid='2480732' name='start_time'>1453213999</jobvalue>
27 <jobvalue jobid='2480732' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
28 <jobvalue jobid='2480732' name='pe_master'>MASTER</jobvalue>
29 </job>
30 <job name='2488323'>
31 <jobvalue jobid='2488323' name='priority'>'0.304351'</jobvalue>
32 <jobvalue jobid='2488323' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
33 <jobvalue jobid='2488323' name='job_name'>lighttpd-limesmap</jobvalue>
34 <jobvalue jobid='2488323' name='job_owner'>tools.limesmap</jobvalue>
35 <jobvalue jobid='2488323' name='job_state'>r</jobvalue>
36 <jobvalue jobid='2488323' name='start_time'>1453233617</jobvalue>
37 <jobvalue jobid='2488323' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
38 <jobvalue jobid='2488323' name='pe_master'>MASTER</jobvalue>
39 </job>
40 <job name='2492682'>
41 <jobvalue jobid='2492682' name='priority'>'0.304058'</jobvalue>
42 <jobvalue jobid='2492682' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
43 <jobvalue jobid='2492682' name='job_name'>lighttpd-magog</jobvalue>
44 <jobvalue jobid='2492682' name='job_owner'>tools.magog</jobvalue>
45 <jobvalue jobid='2492682' name='job_state'>dr</jobvalue>
46 <jobvalue jobid='2492682' name='start_time'>1453244556</jobvalue>
47 <jobvalue jobid='2492682' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
48 <jobvalue jobid='2492682' name='pe_master'>MASTER</jobvalue>
49 </job>
50 <job name='2521687'>
51 <jobvalue jobid='2521687' name='priority'>'0.301964'</jobvalue>
52 <jobvalue jobid='2521687' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
53 <jobvalue jobid='2521687' name='job_name'>lighttpd-kmlexport</jobvalue>
54 <jobvalue jobid='2521687' name='job_owner'>tools.kmlexport</jobvalue>
55 <jobvalue jobid='2521687' name='job_state'>r</jobvalue>
56 <jobvalue jobid='2521687' name='start_time'>1453322719</jobvalue>
57 <jobvalue jobid='2521687' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
58 <jobvalue jobid='2521687' name='pe_master'>MASTER</jobvalue>
59 </job>
60 <job name='2525812'>
61 <jobvalue jobid='2525812' name='priority'>'0.301661'</jobvalue>
62 <jobvalue jobid='2525812' name='qinstance_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
63 <jobvalue jobid='2525812' name='job_name'>lighttpd-whois</jobvalue>
64 <jobvalue jobid='2525812' name='job_owner'>tools.whois</jobvalue>
65 <jobvalue jobid='2525812' name='job_state'>r</jobvalue>
66 <jobvalue jobid='2525812' name='start_time'>1453334046</jobvalue>
67 <jobvalue jobid='2525812' name='queue_name'>webgrid-lighttpd@tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs</jobvalue>
68 <jobvalue jobid='2525812' name='pe_master'>MASTER</jobvalue>
69 </job>
70 </host>

seems pretty small? maybe I'm looking at the wrong thing, but I am curious to compare to other freeze candidates.

back post reboot for now

fwiw it seems like it should be a valid salt client

root@labcontrol1001:~# salt-key -L | grep 1412
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs

but it wasn't responsive this time around

chasemp closed this task as Resolved.Jan 21 2016, 5:19 PM
chasemp claimed this task.