Page MenuHomePhabricator

qsub job running for over 4 months on tool labs exec
Closed, InvalidPublic


I had a qsub job running on tools-exec that has been running for over four months without anyone noticing or deleting the job. This has stopped my bot for 4 months. I'm not sure why it got stuck in the queue, but it seems like jobs should not be allowed to run this long.

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
5323934 0.37235 cron-tools tools.deltaq dr    09/17/2018 18:00:16     1

Apologies if I'm not adding the right tags.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2019, 2:56 AM
JJMC89 updated the task description. (Show Details)Jan 22 2019, 3:02 AM
Legoktm removed a subscriber: Toolforge.Jan 22 2019, 3:04 AM

Sorry to hear that your bot was offline for 4 months. In general, a job that runs for so long is not an issue -- for example, Wikibugs regularly runs for months at a time without job resubmission.

However, you should be able to set a maximum runtime by hand: you can add -l h_rt=0:00:10 (for a 10 second maximum runtime) to the jsub command, for example:

valhallasw@tools-bastion-03:~$ jsub -l h_rt=0:00:10 sleep 100
Your job 1543310 ("sleep") has been submitted

valhallasw@tools-bastion-03:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
1543310 0.30000 sleep      valhallasw   r     01/22/2019 20:28:35     1

valhallasw@tools-bastion-03:~$ sleep 20; qstat
# no output -- job has been killed
bd808 closed this task as Invalid.Jan 22 2019, 9:00 PM
bd808 added a subscriber: bd808.

Monitoring of the jobs associated with a tool is the responsibility of the tool's maintainers. The Toolforge admins will often notice a job that is causing a performance impact to the job gird that is effecting other tools, but we do not have staff or tooling to watch for jobs that are misbehaving in a "quiet" way. As @valhallasw says above, there are many tools that are intended to run for months at a time without interruption.

I'm going to close this task as invalid because I am not seeing anything actionable in the report.