Page MenuHomePhabricator

tools-exec-1216 down
Closed, ResolvedPublic

Description

SGE job 7513178 needs killed

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
valhallasw renamed this task from SGE zombie needs killed to tools-exec-1216 down.Jun 27 2016, 9:02 PM
[2249041.585992] INFO: task nscd:19454 blocked for more than 120 seconds.
[2249041.586769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2249041.587741] INFO: task nscd:19456 blocked for more than 120 seconds.
[2249041.589474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

It would be neat if we could get alerts set up for these kernel errors.

HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1216.eqiad.wmflabs lx26-amd64      4     -    7.8G       -   23.9G       -
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   7480069 0.32404 all-cat    tools.zkbot  dr    06/15/2016 02:56:20 task@tools MASTER
   7483283 0.32389 rmiw.commo tools.yifeib dr    06/15/2016 04:54:10 task@tools MASTER
   7512885 0.32247 cron-tools tools.himo   r     06/15/2016 23:00:11 task@tools MASTER
   7512903 0.32247 ASB        tools.james  r     06/15/2016 23:00:11 task@tools MASTER
   7513114 0.32246 add_candid tools.wikida r     06/15/2016 23:07:02 task@tools MASTER
   7513147 0.32245 cleanup    tools.catsca r     06/15/2016 23:09:02 task@tools MASTER
   7513178 0.32245 csd_report tools.betaco dt    06/15/2016 23:10:05 task@tools MASTER
   7513197 0.32245 fix        tools.mjbmr- t     06/15/2016 23:15:05 task@tools MASTER

The corresponding jobs are:

exec_file:                  job_scripts/7480069
script_file:                /data/project/zkbot/all-cat.sh
exec_file:                  job_scripts/7483283
job_args:                   /shared/pywikipedia/core/pwb.py,/data/project/yifeibot/pywikibot-shared/addbot.py,-family:commons,-lang:commons
script_file:                /usr/bin/python2.7
exec_file:                  job_scripts/7512885
job_args:                   /data/project/himo/catday.sh
script_file:                /bin/dash
exec_file:                  job_scripts/7512903
job_args:                   /data/project/james/adminstatsbot/adminstatsbot.py
script_file:                /usr/bin/python2.7
exec_file:                  job_scripts/7513114
script_file:                /data/project/wikidata-todo/scripts/duplicity_bot/add_candidates.php
exec_file:                  job_scripts/7513147
script_file:                /data/project/catscan2/cleanup.sh
exec_file:                  job_scripts/7513178
job_args:                   /data/project/betacommand-dev/svn_copy/sql_csd.py
script_file:                /usr/bin/python2.7
exec_file:                  job_scripts/7513197
script_file:                /data/project/mjbmr-tools/.sys/scripts/fix
valhallasw claimed this task.

I have rebooted the host and it should be back up shortly.

tom29739 reopened this task as Open.EditedJun 27 2016, 9:30 PM
tom29739 subscribed.

@valhallasw looks like it's down again.

tom29739@tools-bastion-03:~$ ssh tools-exec-1216
ssh: connect to host tools-exec-1216 port 22: No route to host

@tom29739: Thanks for reopening the bug!

The host is now in ERROR state. @Andrew, is that related to the shortage of resources on labs?

This happened to exec-1212, I migrated it and it seems to be back up - not sure if it is pooled in though? I'll try to migrate tools-exec-1216 and see what happens.

More hosts that are dead (unreachable via root ssh):

tools-exec-1219.eqiad.wmflabs
tools-exec-1402.eqiad.wmflabs
tools-webgrid-lighttpd-1407.eqiad.wmflabs
tools-webgrid-lighttpd-1402.eqiad.wmflabs

I've migrated 1216 to 1011 and it's back up. 1219 just survived the restart, so did webgrid-lighttpd-1402. tools-webgrid-lighttpd-1407 seems to be fine without a restart - not sure why my earlier ssh to it failed.

All are back up now.