Page MenuHomePhabricator

`qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job
Open, Needs TriagePublic

Description

I received the below error email from cron on 2019-06-09 at 06:34.

Cron <tools.jjmc89-bot@tools-sgecron-01> jstart -e /dev/null -o /dev/null purge_dup_args
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

I was able to start the job from tools-sgebastion-07 at 06:45 and the next cron job started fine, so it might be an intermittant or one-off issue.

Event Timeline

JJMC89 created this task.Jun 9 2019, 6:57 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 9 2019, 6:57 AM
bd808 renamed this task from failed receiving gdi request response for mid=1 to `qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job.Jun 16 2019, 5:13 PM

I received more of these on 2019-06-16 at 06:34 (1 job) and 06:48 (2 jobs).

I received more of these on 2019-06-23 at 06:44 (2 jobs) and 06:48 (1 job).

Bstorm added a subscriber: Bstorm.Tue, Oct 1, 12:40 AM

The most important line is the first one. Everything after Traceback is from the python wrapper, obviously (which could use some error handling, I think). Overall, that error "failed receiving gdi request response for..." seems to happen when there's an overall issue on the NFS or network that the grid depends on. I see a cron of mine got hit with it when we had an issue across Toolforge very early this morning.

What's interesting to me is that there's no real qmaster messages that correlate at the moment, but a regularity in that message suggests the qmaster in the grid is getting wedged here and there (possibly from cron floods). Something to watch.