Page MenuHomePhabricator

tools-exec-1207 hanging
Closed, ResolvedPublic

Description

Ubuntu 12.04.5 LTS tools-exec-1207 ttyS0

tools-exec-1207 login: [769561.116084] INFO: task mono-sgen:23722 blocked for more than 120 seconds.
[769561.123011] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.125403] INFO: task mono-sgen:23724 blocked for more than 120 seconds.
[769561.126359] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.127463] INFO: task mono-sgen:23725 blocked for more than 120 seconds.
[769561.128364] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.129375] INFO: task mono-sgen:23726 blocked for more than 120 seconds.
[769561.130109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.130996] INFO: task mono-sgen:23727 blocked for more than 120 seconds.
[769561.131919] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.132984] INFO: task mono-sgen:23729 blocked for more than 120 seconds.
[769561.134287] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.135245] INFO: task mono-sgen:23736 blocked for more than 120 seconds.
[769561.136112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769561.137186] INFO: task mono-sgen:23794 blocked for more than 120 seconds.
[769561.138077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769681.137028] INFO: task mono-sgen:23722 blocked for more than 120 seconds.
[769681.145458] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[769681.146831] INFO: task mono-sgen:23724 blocked for more than 120 seconds.
[769681.147731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolvedvalhallasw

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
16:45 <Kelson> valhallasw`cloud: I gess the node went out-of-memoy and probably in a freeze 
16:45 <valhallasw`cloud> why do you think so?
16:46 <Kelson> valhallasw`cloud: because I get an error in the job log about "out of memory"

For now, I'm killing/rescheduling all jobs on that host. @chasemp, do you want to investigate the deeper cause or shall we just reboot the host?

valhallasw@tools-bastion-02:/data/project/enwp10$ qhost -j -h tools-exec-1207.eqiad.wmflabs
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1207.eqiad.wmflabs lx26-amd64      4     -    7.8G       -   23.9G       -
   job-ID  prior   name       user         state submit/start at     queue      master ja-task-ID
   ----------------------------------------------------------------------------------------------
   2562456 0.53512 enwiki_upd tools.xtools Rr    05/25/2016 17:35:32 continuous MASTER
   5998409 0.34276 rmiw.w3    tools.yifeib r     05/05/2016 09:51:06 continuous MASTER
   6454040 0.31875 comsign    tools.yifeib r     05/18/2016 10:35:17 continuous MASTER
   6822525 0.30121 welcome    tools.dimast r     05/27/2016 23:00:18 continuous MASTER
   6668316 0.30886 wp10-selec tools.enwp10 dr    05/23/2016 19:26:31 task@tools MASTER
   6839085 0.30037 rdallvoy   tools.avicbo r     05/28/2016 10:01:13 task@tools MASTER
valhallasw@tools-bastion-02:/data/project/enwp10$ qdel -f 6668316 6839085
warning: valhallasw forced the deletion of job 6668316
warning: valhallasw forced the deletion of job 6839085
valhallasw@tools-bastion-02:/data/project/enwp10$ qmod -rj 2562456 5998409 6454040 6822525
Pushed rescheduling of job 2562456 on host tools-exec-1207.eqiad.wmflabs
Pushed rescheduling of job 5998409 on host tools-exec-1207.eqiad.wmflabs
Pushed rescheduling of job 6454040 on host tools-exec-1207.eqiad.wmflabs
Pushed rescheduling of job 6822525 on host tools-exec-1207.eqiad.wmflabs

The host is now empty.

@Kelson, the wp10-select task was force-deleted;
@Avicennasis, the rdallvoy task was also force-deleted. Please resubmit the task if it should be run again.

I would reboot this for now, fairly comfortable saying this is likely nfs
maint fallout. Thanks

Rebooted via wikitech; should be online again in a short while.

valhallasw claimed this task.

Jobs are being scheduled again.