Since around the beginning of August 2016, I've found that some of my bot's jobs are locking up. That is, the job is submitted and I can see it with qstat, but it never actually starts, and when the cron tries to submit a new job it errors with there is a job named 'my job' already active.
I'll use my "perm_clerk" job as an example. Here some notes:
- Cron looks like */10 * * * * jsub -l release=trusty -mem 350m -once ~/perm_clerk.sh >/dev/null 2>&1 (my other jobs look similar)
- perm_clerk.sh just exports Ruby stuff to PATH and runs the Ruby file
- The Ruby script outputs stuff to another log as soon as it gets ran. I don't see this output when the job locks up
- When in a locked state, qstat looks normal: 9897506 0.30204 perm_clerk tools.musikb r 08/17/2016 17:30:14 task@tools-exec-1407.eqiad.wmf 1
- I have to manually qdel the job for it get back in working order
- Sometimes the job will lock up on the next run. E.g. the issue is consistent but not predictable
- My stale_drafts job locked up but it only runs once a week, so I don't think this happens because a job is submitted when the old one hasn't finished
Surely others are experiencing this?