Page MenuHomePhabricator

Jobs Disappearing from SGE
Closed, ResolvedPublic

Description

A number of jobs have disappeared from the grid. At least one of them (621) was low numbered and I suspect the jobid reset last month might have something to do with it.

The following is from qacct -j 621
https://tools.wmflabs.org/paste/view/267ff131

Event Timeline

A930913 raised the priority of this task from to Needs Triage.
A930913 updated the task description. (Show Details)
A930913 added a project: Toolforge.
A930913 subscribed.

IRC logs show death throes.

22:13 <wm-bot2> BracketBot: Unloading script.                   
22:13 <wm-bot2> BracketBot: Loading script.                    
22:13 <wm-bot2> BracketBot: Loading script.                    
22:14 <wm-bot2> BracketBot: Unloading script.            
22:14 <wm-bot2> BracketBot: Loading script.                
Day changed to Thu, 16 Apr 2015

Unloading script is sent in the python finally block of the script piped into from the executed script.

So, to clarify, this happened 15 apr 2015 around 22:14 UTC? Could be related to T95555: Disable idmap entirely on Labs Precise instances, then but @coren knows better about the reboots for that. Unfortunately, tools-exec-15 is no more, so it's also hard to trace back in logs what happened...

valhallasw claimed this task.

It's not exactly resolved, but by now, there's no way to figure out what has happened, so closing this seems the most reasonable option.

Happened again, 13th Aug. Both DefconBot and BracketBot lost continuous jobs.

16:00 <wm-bot2> BracketBot: Loading script.
16:00 <wm-bot2> BracketBot: Unloading script.
16:01 <wm-bot2> BracketBot: Loading script.
16:01 <wm-bot2> BracketBot: Unloading script.
16:01 <wm-bot2> BracketBot: Loading script.

hostname     tools-exec-1214.eqiad.wmflabs
group        tools.bracketbot
owner        tools.bracketbot
project      NONE
department   defaultdepartment
jobname      run
jobnumber    5543
taskid       undefined
account      sge
priority     10
qsub_time    Fri Jun 19 17:25:56 2015
start_time   Fri Jun 19 17:25:56 2015
end_time     Thu Aug 13 15:01:18 2015
granted_pe   NONE
slots        1
failed       100 : assumedly after job
exit_status  143

The timing corresponds to the server restarts, which suggests that for some reason the job was not gracefully restarted on another host. I don't see any 'job failed' emails either, which I think means the job was not killed but still running when the server went down.

Although exit_status 143 corresponds to SIGTERM which would suggest it was qdel'ed. I've asked @A930913 to try to qmod -rj the task themselves, so that we can see if that is the cause of the issue.

qmod -rj triggers

16:12 <wm-bot2> BracketBot: Loading script.
16:12 <wm-bot2> BracketBot: Unloading script.
16:12 <wm-bot2> BracketBot: Loading script.

but the bot is correctly rescheduled.

Looking a bit further at qacct shows the job being rescheduled (without date/time mentioned, but that's apparently how SGE rolls), then the job is rescheduled on the same host (-1214) before dying.

The same happens for defconbot. It's first being rescheduled from -1210 to -1214, then from -1214 to -1214, then dying.

@Andrew, could it be possible the continuous queue on tools-exec-1214 was not disabled correctly, thus causing jobs to be rescheduled back there?

After discussing, we figured out it was a bug in our management scripts. This caused jobs to not be rescheduled, so they then died during the reboot. We have now fixed the script, so that issue should be resolved.