This ticket is to track cleanup work on some queue errors on the grid.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | dcaro | T290970 File System corruption on cloud-vps instances | |||
Resolved | aborrero | T290798 tools-sgeexec-0907 filesystem corruption | |||
Resolved | aborrero | T290844 toolforge grid master queue error cleanup |
Event Timeline
Comment Actions
fetching information, per instructions at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#GridEngine_Master
aborrero@tools-sgegrid-master:~$ qstat -f | grep E task@tools-sgeexec-0907.tools. BI 0/0/50 0.00 lx-amd64 dE continuous@tools-sgeexec-0907. BC 0/0/50 0.00 lx-amd64 dE aborrero@tools-sgegrid-master:~$ qstat -explain E -xml | grep -e name -e state -e message | grep QERROR <message>queue task marked QERROR as result of job 9239617's failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message> <message>queue task marked QERROR as result of job 9239619's failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message> <message>queue task marked QERROR as result of job 9239629's failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message> <message>queue continuous marked QERROR as result of job 9239656's failure at host tools-sgeexec-0907.tools.eqiad.wmflabs</message> aborrero@tools-sgegrid-master:~$ qstat -j 9239629 Following jobs do not exist: 9239629 aborrero@tools-sgegrid-master:~$ qstat -j 9239619 Following jobs do not exist: 9239619 aborrero@tools-sgegrid-master:~$ qstat -j 9239617 Following jobs do not exist: 9239617 aborrero@tools-sgegrid-master:~$ qstat -j 9239656 Following jobs do not exist: 9239656
Comment Actions
Mentioned in SAL (#wikimedia-cloud) [2021-09-13T08:57:06Z] <arturo> cleared grid queues error states (T290844)