Page MenuHomePhabricator

`qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job
Closed, DeclinedPublic

Description

I received the below error email from cron on 2019-06-09 at 06:34.

Cron <tools.jjmc89-bot@tools-sgecron-01> jstart -e /dev/null -o /dev/null purge_dup_args
error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

I was able to start the job from tools-sgebastion-07 at 06:45 and the next cron job started fine, so it might be an intermittant or one-off issue.

Event Timeline

bd808 renamed this task from failed receiving gdi request response for mid=1 to `qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job.Jun 16 2019, 5:13 PM

I received more of these on 2019-06-16 at 06:34 (1 job) and 06:48 (2 jobs).

I received more of these on 2019-06-23 at 06:44 (2 jobs) and 06:48 (1 job).

The most important line is the first one. Everything after Traceback is from the python wrapper, obviously (which could use some error handling, I think). Overall, that error "failed receiving gdi request response for..." seems to happen when there's an overall issue on the NFS or network that the grid depends on. I see a cron of mine got hit with it when we had an issue across Toolforge very early this morning.

What's interesting to me is that there's no real qmaster messages that correlate at the moment, but a regularity in that message suggests the qmaster in the grid is getting wedged here and there (possibly from cron floods). Something to watch.

After not seeing this for a while, I just got a batch of these today at 06:34, 06:44, and 06:48.

After not seeing this for a while, I just got a batch of these today at 06:34, 06:44, and 06:48.

Another batch today at the same three times.

It looks like this happens Sundays starting around 06:30.

Over the last two days, I've gotten five of these.

We just got one earlier today (520 AM PST). I don't think we have a clear pattern on timing just yet.

On the one we got today, it looks like the job was fine. qstat was the only error. That's very interesting. So it seems that qstat just fails sometimes due to network lag or whatever. I'm not seeing a timeout parameter yet, but this suggests a path to making the messages from the toollabs package a bit better.