Page MenuHomePhabricator

Grid job stuck at 't' state
Closed, ResolvedPublic

Description

Yesterday:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
6269380 0.32841 php_transl tools.liange t     05/28/2016 10:04:01 task@tools-exec-1408.eqiad.wmf     1

Today:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
6269380 0.32969 php_transl tools.liange t     05/29/2016 03:54:01 task@tools-exec-1408.eqiad.wmf     1

qstat:

==============================================================
job_number:                 6269380
exec_file:                  job_scripts/6269380
submission_time:            Fri May 13 00:21:01 2016
owner:                      tools.liangent-php
uid:                        51117
group:                      tools.liangent-php
gid:                        51117
sge_o_home:                 /data/project/liangent-php
sge_o_log_name:             tools.liangent-php
sge_o_path:                 /data/project/liangent-shared/bin/crontab:/usr/local/bin:/usr/bin:/bin
sge_o_shell:                /bin/sh
sge_o_workdir:              /data/project/liangent-php
sge_o_host:                 tools-cron-01
account:                    sge
stderr_path_list:           NONE:NONE:/data/project/liangent-php/php_translateVariants_zhwiki.err
hard resource_list:         h_vmem=2097152k,release=trusty
mail_list:                  tools.liangent-php@tools.wmflabs.org
notify:                     FALSE
job_name:                   php_translateVariants_zhwiki
stdout_path_list:           NONE:NONE:/data/project/liangent-php/php_translateVariants_zhwiki.out
jobshare:                   0
hard_queue_list:            task
env_list:                     
job_args:                   /data/project/liangent-php/mw/maintenance/translateVariants.php,--wiki=zhwiki~sysop~wgDisabledVariants=,--lang=zh,--ns=8,--delete,--table=ns8,--init=User:Liangent-bot/message/ns8-noteta-it
script_file:                /usr/bin/php5
usage    1:                 cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A  
scheduling info:            queue instance "continuous@tools-exec-1401.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=31.460000 (= 31.460000 + 0.50 * 0.000000 with nproc=4) >= 1.75  
                            queue instance "continuous@tools-exec-1406.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.395000 (= 2.395000 + 0.50 * 0.000000 with nproc=4) >= 1.75  
                            queue instance "continuous@tools-exec-1213.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.827500 (= 2.827500 + 0.50 * 0.000000 with nproc=4) >= 1.75  
                            queue instance "mailq@tools-exec-1401.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=31.460000 (= 31.460000 + 0.50 * 0.000000 with nproc=4) >= 2.25
                            queue instance "mailq@tools-exec-1406.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.395000 (= 2.395000 + 0.50 * 0.000000 with nproc=4) >= 2.25
                            queue instance "mailq@tools-exec-1213.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.827500 (= 2.827500 + 0.50 * 0.000000 with nproc=4) >= 2.25
                            queue instance "task@tools-exec-1213.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.827500 (= 2.827500 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-exec-1406.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=2.395000 (= 2.395000 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "task@tools-exec-1401.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=31.460000 (= 31.460000 + 0.50 * 0.000000 with nproc=4) >= 1.75
                            queue instance "continuous@tools-exec-1407.eqiad.wmflabs" dropped because it is disabled
                            queue instance "continuous@tools-exec-1219.eqiad.wmflabs" dropped because it is disabled
                            queue instance "continuous@tools-exec-1216.eqiad.wmflabs" dropped because it is disabled
                            queue instance "continuous@tools-exec-1218.eqiad.wmflabs" dropped because it is disabled
                            queue instance "mailq@tools-exec-1407.eqiad.wmflabs" dropped because it is disabled
                            queue instance "mailq@tools-exec-1219.eqiad.wmflabs" dropped because it is disabled
                            queue instance "mailq@tools-exec-1216.eqiad.wmflabs" dropped because it is disabled
                            queue instance "mailq@tools-exec-1218.eqiad.wmflabs" dropped because it is disabled
                            queue instance "task@tools-exec-1407.eqiad.wmflabs" dropped because it is disabled
                            queue instance "task@tools-exec-1219.eqiad.wmflabs" dropped because it is disabled
                            queue instance "task@tools-exec-1216.eqiad.wmflabs" dropped because it is disabled
                            queue instance "task@tools-exec-1218.eqiad.wmflabs" dropped because it is disabled

Event Timeline

liangent created this task.May 29 2016, 3:57 AM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptMay 29 2016, 3:57 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
liangent updated the task description. (Show Details)May 29 2016, 3:58 AM
valhallasw added a subscriber: valhallasw.EditedMay 29 2016, 8:15 AM

State 't' means 'in process of being transferred to the exec host'. The thing that surprises me is that the start time changes:

05/28/2016 10:04:01
05/29/2016 03:54:01`

and now

05/29/2016 07:59:01

which suggests that the job is being rescheduled all the time, but failing to do so for some reason. It's currently in state dt, which is also weird: as long as the exec host daemon responds, there is no reason for jobs to not be deleted.

The grid engine master log shows the following:

05/13/2016 00:21:02|worker|tools-grid-master|W|job 6269380.1 failed on host tools-exec-1408.eqiad.wmflabs general assumedly before job because: can't get password entry for user "tools.liangent-php". Either the user does not exist or NIS error!
05/13/2016 00:21:02|worker|tools-grid-master|W|rescheduling job 6269380.1
05/13/2016 13:42:22| timer|tools-grid-master|W|failed to deliver job 6269380.1 to queue "task@tools-exec-1408.eqiad.wmflabs"
[this last message repeats every ten minutes]

The exec host shows nothing apart from

05/13/2016 00:21:02|  main|tools-exec-1408|E|can't start job "6269380": can't get password entry for user "tools.liangent-php". Either the user does not exist or NIS error!

There are a total of four jobs in t state:

valhallasw@tools-bastion-03:/data/project/.system/gridengine/spool/qmaster$ qstat -u "*" | grep " d\?t "
6269306 0.33000 clean      tools.avicbo t     05/29/2016 08:04:04 task@tools-exec-1214.eqiad.wmf     1
6269323 0.33000 inv-wes-21 tools.invadi t     05/29/2016 08:04:11 task@tools-exec-1201.tools.eqi     1
6269346 0.33000 wmcounter  tools.wmcoun t     05/29/2016 08:04:12 task@tools-exec-1210.eqiad.wmf     1
6269380 0.33000 php_transl tools.liange dt    05/29/2016 08:04:01 task@tools-exec-1408.eqiad.wmf     1

All of them have the same issue with resolving users:

05/13/2016 00:17:09|worker|tools-grid-master|W|job 6269306.1 failed on host tools-exec-1214.eqiad.wmflabs general assumedly before job because: can't get password entry for user "tools.avicbot". Either the user does not exist or NIS error!
05/13/2016 00:19:01|worker|tools-grid-master|W|job 6269323.1 failed on host tools-exec-1201.tools.eqiad.wmflabs general assumedly before job because: can't get password entry for user "tools.invadibot". Either the user does not exist or NIS error!
05/13/2016 00:20:07|worker|tools-grid-master|W|job 6269346.1 failed on host tools-exec-1210.eqiad.wmflabs general assumedly before job because: can't get password entry for user "tools.wmcounter". Either the user does not exist or NIS error!

but it's not clear to me why these jobs got stuck and others did not...

@Avicennasis, @abian, @Emijrp, @Pcoombe, @liangent, I have force-deleted the jobs in the post above; please resubmit the jobs (they should not have started, so it is probably safe to do so). Sorry for the inconvenience.

valhallasw@tools-bastion-03:/data/project/.system/gridengine/spool/qmaster$ qdel -f 6269306 6269323 6269346 6269380
warning: valhallasw forced the deletion of job 6269306
warning: valhallasw forced the deletion of job 6269323
warning: valhallasw forced the deletion of job 6269346
warning: valhallasw forced the deletion of job 6269380
chasemp closed this task as Resolved.May 31 2016, 2:51 PM
chasemp claimed this task.
chasemp added a subscriber: chasemp.

seems none today, I'll close this but reopen if I'm missing something please