Page MenuHomePhabricator

Toolforge: Completed jobs not available via qstat
Closed, ResolvedPublic

Description

Jobs that recently exited aren't available via qstat -j #JOB, which isn't the expected behavior per https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Returning_the_status_of_a_particular_job.

Notes from IRC:

15:13:03 <bd808> legoktm: did the job exit recently? We rotate the data file for that state lookup
15:13:44 <legoktm> bd808: it exited probably within ~10 min of me trying to look at the exit status
15:13:56 <bd808> hmm
15:14:39 <bd808> sigint almost always means OOM
15:14:54 <legoktm> bd808: 7142037 exited about a minute ago and doesn't exist according to `qstat -j`
15:15:44 <legoktm> yeah, I guessed as much. At this point the `qstat -j` questions are more to figure out whether the documentation is out of date or something isn't working as expected
15:16:16 <bd808> `qstat -j '*'` has stuff, but not nearly as much as I would expect
15:17:21 <bd808> it should list all the things that you can see at https://sge-status.toolforge.org/
15:17:33 <bd808> and it pretty obviously does not
15:18:43 <bd808> hmm.. or does it
15:18:59 <bd808>  /usr/bin/qstat -j '*' | grep job_number|wc -l == 757 jobs
15:25:53 <bd808> legoktm: I am not sure why, but qstat seems to only show running jobs even when looked up by id and not any historical jobs
15:26:24 <bd808> we haven't done anything purposefully to change the grid for a long time
15:26:32 <legoktm> should I file a bug?
15:27:06 <bd808> I wonder if tracking historic jobs got messed up by nfs restarts or something?
15:27:10 <bd808> legoktm: sure

Event Timeline

According to the man page this command only shows things for stuck, errored and running jobs. I'd expect qacct -j <jobname> to show historical jobs, personally. I don't recall qstat ever showing historical information when I was rebuilding the grid. I expect that if it could, it would take as long to return as qacct because it would have to scan the accounting file.

Note: qacct -j 7142037 took my terminal like 3 minutes at least:

bstorm@tools-sgebastion-09:~$ qacct -j 7142037
==============================================================
qname        task
hostname     tools-sgeexec-0913.tools.eqiad.wmflabs
group        tools.newusers
owner        tools.newusers
project      NONE
department   defaultdepartment
jobname      build3
jobnumber    7142037
taskid       undefined
account      sge
priority     0
qsub_time    Fri Jun 26 22:01:19 2020
start_time   Fri Jun 26 22:01:32 2020
end_time     Fri Jun 26 22:12:52 2020
granted_pe   NONE
slots        1
failed       37  : qmaster enforced h_rt, h_cpu, or h_vmem limit
exit_status  130                  (Interrupt)
ru_wallclock 680s
ru_utime     0.000s
ru_stime     0.012s
ru_maxrss    6.039KB
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    814
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     7
ru_nivcsw    6
cpu          256.210s
mem          41.985GBs
io           150.539GB
iow          0.000s
maxvmem      1.004GB
arid         undefined
ar_sub_time  undefined
category     -q task -l h_vmem=1048576k
Legoktm claimed this task.
Legoktm added a project: Documentation.

After discussion on IRC, this is mostly a documentation issue, which should now be resolved: https://wikitech.wikimedia.org/w/index.php?title=Help%3AToolforge%2FGrid&type=revision&diff=1871328&oldid=1870115