Page MenuHomePhabricator

toolforge grid master queue error cleanup
Closed, ResolvedPublic

Description

This ticket is to track cleanup work on some queue errors on the grid.

Event Timeline

fetching information, per instructions at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#GridEngine_Master

aborrero@tools-sgegrid-master:~$ qstat -f | grep E
task@tools-sgeexec-0907.tools. BI    0/0/50         0.00     lx-amd64      dE
continuous@tools-sgeexec-0907. BC    0/0/50         0.00     lx-amd64      dE
aborrero@tools-sgegrid-master:~$ qstat -explain E -xml | grep -e name -e state -e message | grep QERROR
      <message>queue task marked QERROR as result of job 9239617&apos;s failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message>
      <message>queue task marked QERROR as result of job 9239619&apos;s failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message>
      <message>queue task marked QERROR as result of job 9239629&apos;s failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message>
      <message>queue continuous marked QERROR as result of job 9239656&apos;s failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message>
aborrero@tools-sgegrid-master:~$ qstat -j 9239629
Following jobs do not exist: 
9239629
aborrero@tools-sgegrid-master:~$ qstat -j 9239619
Following jobs do not exist: 
9239619
aborrero@tools-sgegrid-master:~$ qstat -j 9239617
Following jobs do not exist: 
9239617
aborrero@tools-sgegrid-master:~$ qstat -j 9239656
Following jobs do not exist: 
9239656

Mentioned in SAL (#wikimedia-cloud) [2021-09-13T08:57:06Z] <arturo> cleared grid queues error states (T290844)