Page MenuHomePhabricator

SGE Job gerrit_reviewer_bot (9999704) stuck in delete state
Closed, ResolvedPublic

Description

Hi all,

I've got some trouble with a job that is not running (even though SGE thinks it is) and that doesn't seem to be disappearing after running qdel. Because the job is started from a crontab using -once, this is blocking the job from running.

Specifically:

Job 9999704 (gerrit_reviewer_bot) is supposed to be short running (~1 minute, explicitly capped at 1 hour). Yet it has been running for several days:

tools.gerrit-reviewer-bot@tools-sgebastion-07:~$ date
Sun Mar 28 13:28:27 UTC 2021
tools.gerrit-reviewer-bot@tools-sgebastion-07:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
9999704 0.25753 gerrit_rev tools.gerrit dr    03/25/2021 17:49:16 task@tools-sgeexec-0920.tools.     1
   1606 0.25729 lighttpd-g tools.gerrit r     03/25/2021 18:16:16 webgrid-lighttpd@tools-sgewebg     1

I have tried the following to clear the job:

  • qdel 9999704. This did change the job status to d but did not actually stop it
  • qdel -f 9999704 - no (additional) effect
  • logging into tools-sgeexec-0920 to kill the job -- but nothing seems to be running there.

The job ID makes me wonder if this is some sort of rollover issue.

For now I've changed the job name in crontab so -once is not blocking a new run from starting.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2021-03-28T19:31:34Z] <legoktm> legoktm@tools-sgebastion-08:~$ sudo qdel -f 9999704 # T278645

Legoktm claimed this task.
Legoktm added a subscriber: Legoktm.