Page MenuHomePhabricator

Toolforge grid queue problem: epilog failed
Closed, ResolvedPublic

Description

We had this grid queue error today:

aborrero@tools-sgewebgrid-lighttpd-0914:~$ qstat -explain E -xml | grep -e name -e state -e message
[..]
      <name>webgrid-lighttpd@tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs</name>
      <state>E</state>
      <message>queue webgrid-lighttpd marked QERROR as result of job 1780991&apos;s failure at host tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs</message>
[..]
aborrero@tools-sgewebgrid-lighttpd-0914:~$ grep 1780991 /data/project/.system_sge/gridengine/spool/qmaster/messages*
/data/project/.system_sge/gridengine/spool/qmaster/messages:03/27/2022 16:00:43|worker|tools-sgegrid-master|W|job 1780991.1 failed on host tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs general in epilog because: 03/27/2022 16:00:43 [600:10737]: exit_status of epilog = 1
/data/project/.system_sge/gridengine/spool/qmaster/messages:03/27/2022 16:00:43|worker|tools-sgegrid-master|E|queue webgrid-lighttpd marked QERROR as result of job 1780991's failure at host tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs

The grid supports a mechanism called epilog/prolog, which is a way to run some code before/after the job itself is run.
In the case of the webgrid, the epilog runs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Dynamicproxy#Grid_web_services_portgrabber_and_portreleaser to hook the webservice to the front proxy.

If a web job fails with something related to epilog or prolog, then it is likely that it failed to allocate/release a port. This shouldn't be a big deal unless there is a pattern.

Event Timeline

Mentioned in SAL (#wikimedia-cloud) [2022-03-28T09:32:18Z] <wm-bot> cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo