SGE job 7513178 needs killed
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Andrew | T137857 Labs instances failing with "internal error: No PCI buses available" | |||
Resolved | valhallasw | T138787 tools-exec-1216 down |
Event Timeline
Comment Actions
[2249041.585992] INFO: task nscd:19454 blocked for more than 120 seconds. [2249041.586769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [2249041.587741] INFO: task nscd:19456 blocked for more than 120 seconds. [2249041.589474] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
It would be neat if we could get alerts set up for these kernel errors.
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - tools-exec-1216.eqiad.wmflabs lx26-amd64 4 - 7.8G - 23.9G - job-ID prior name user state submit/start at queue master ja-task-ID ---------------------------------------------------------------------------------------------- 7480069 0.32404 all-cat tools.zkbot dr 06/15/2016 02:56:20 task@tools MASTER 7483283 0.32389 rmiw.commo tools.yifeib dr 06/15/2016 04:54:10 task@tools MASTER 7512885 0.32247 cron-tools tools.himo r 06/15/2016 23:00:11 task@tools MASTER 7512903 0.32247 ASB tools.james r 06/15/2016 23:00:11 task@tools MASTER 7513114 0.32246 add_candid tools.wikida r 06/15/2016 23:07:02 task@tools MASTER 7513147 0.32245 cleanup tools.catsca r 06/15/2016 23:09:02 task@tools MASTER 7513178 0.32245 csd_report tools.betaco dt 06/15/2016 23:10:05 task@tools MASTER 7513197 0.32245 fix tools.mjbmr- t 06/15/2016 23:15:05 task@tools MASTER
The corresponding jobs are:
exec_file: job_scripts/7480069 script_file: /data/project/zkbot/all-cat.sh exec_file: job_scripts/7483283 job_args: /shared/pywikipedia/core/pwb.py,/data/project/yifeibot/pywikibot-shared/addbot.py,-family:commons,-lang:commons script_file: /usr/bin/python2.7 exec_file: job_scripts/7512885 job_args: /data/project/himo/catday.sh script_file: /bin/dash exec_file: job_scripts/7512903 job_args: /data/project/james/adminstatsbot/adminstatsbot.py script_file: /usr/bin/python2.7 exec_file: job_scripts/7513114 script_file: /data/project/wikidata-todo/scripts/duplicity_bot/add_candidates.php exec_file: job_scripts/7513147 script_file: /data/project/catscan2/cleanup.sh exec_file: job_scripts/7513178 job_args: /data/project/betacommand-dev/svn_copy/sql_csd.py script_file: /usr/bin/python2.7 exec_file: job_scripts/7513197 script_file: /data/project/mjbmr-tools/.sys/scripts/fix
Comment Actions
@valhallasw looks like it's down again.
tom29739@tools-bastion-03:~$ ssh tools-exec-1216 ssh: connect to host tools-exec-1216 port 22: No route to host
Comment Actions
This happened to exec-1212, I migrated it and it seems to be back up - not sure if it is pooled in though? I'll try to migrate tools-exec-1216 and see what happens.
Comment Actions
More hosts that are dead (unreachable via root ssh):
tools-exec-1219.eqiad.wmflabs tools-exec-1402.eqiad.wmflabs tools-webgrid-lighttpd-1407.eqiad.wmflabs tools-webgrid-lighttpd-1402.eqiad.wmflabs
Comment Actions
I've migrated 1216 to 1011 and it's back up. 1219 just survived the restart, so did webgrid-lighttpd-1402. tools-webgrid-lighttpd-1407 seems to be fine without a restart - not sure why my earlier ssh to it failed.