Since around the beginning of August 2016, I've found that some of my bot's jobs are locking up. That is, the job is submitted and I can see it with `qstat`, but it never actually starts, and when the cron tries to submit a new job it errors with `there is a job named 'my job' already active`.
I'll use my "perm_clerk" job as an example. Here some notes:
* Cron looks like `*/10 * * * * jsub -l release=trusty -mem 350m -once ~/perm_clerk.sh >/dev/null 2>&1` (my other jobs look similar)
* `perm_clerk.sh` just exports Ruby stuff to PATH and runs the Ruby file
* The Ruby script outputs stuff to another log as soon as it gets ran. I don't see this output when the job locks up
* When in a locked state, `qstat` looks normal: `9897506 0.30204 perm_clerk tools.musikb r 08/17/2016 17:30:14 task@tools-exec-1407.eqiad.wmf 1`
* I have to manually `qdel` the job for it get back in working order
* Sometimes the job will lock up on the next run. E.g. the issue is consistent but not predictable
* My `stale_drafts` job locked up but it only runs once a week, so I don't think this happens because a job is submitted when the old one hasn't finished
Surely this others are experiencing this?