job
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	JJMC89
	Jun 9 2019, 6:57 AM

Description

I received the below error email from cron on 2019-06-09 at 06:34.

Cron <tools.jjmc89-bot@tools-sgecron-01> jstart -e /dev/null -o /dev/null purge_dup_args

error: failed receiving gdi request response for mid=1 (got syncron message receive timeout error).
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 1345, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

I was able to start the job from tools-sgebastion-07 at 06:45 and the next cron job started fine, so it might be an intermittant or one-off issue.

Related Objects

Mentioned In: T183216: grid engine master crashes when any tools-services-* host is added to DNS

Event Timeline

JJMC89 created this task.Jun 9 2019, 6:57 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 9 2019, 6:57 AM

bd808 renamed this task from failed receiving gdi request response for mid=1 to `qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job.Jun 16 2019, 5:13 PM

I received more of these on 2019-06-16 at 06:34 (1 job) and 06:48 (2 jobs).

I received more of these on 2019-06-23 at 06:44 (2 jobs) and 06:48 (1 job).

Krinkle mentioned this in T183216: grid engine master crashes when any tools-services-* host is added to DNS.Jul 25 2019, 7:37 PM

• Bstorm added a project: cloud-services-team (Kanban).Oct 1 2019, 12:04 AM

The most important line is the first one. Everything after Traceback is from the python wrapper, obviously (which could use some error handling, I think). Overall, that error "failed receiving gdi request response for..." seems to happen when there's an overall issue on the NFS or network that the grid depends on. I see a cron of mine got hit with it when we had an issue across Toolforge very early this morning.

What's interesting to me is that there's no real qmaster messages that correlate at the moment, but a regularity in that message suggests the qmaster in the grid is getting wedged here and there (possibly from cron floods). Something to watch.

After not seeing this for a while, I just got a batch of these today at 06:34, 06:44, and 06:48.

In T225373#5759137, @JJMC89 wrote:

After not seeing this for a while, I just got a batch of these today at 06:34, 06:44, and 06:48.

Another batch today at the same three times.

It looks like this happens Sundays starting around 06:30.

• Bstorm triaged this task as Medium priority.Jan 22 2020, 10:19 PM

Over the last two days, I've gotten five of these.

I also got one last night.

We just got one earlier today (520 AM PST). I don't think we have a clear pattern on timing just yet.

On the one we got today, it looks like the job was fine. qstat was the only error. That's very interesting. So it seems that qstat just fails sometimes due to network lag or whatever. I'm not seeing a timeout parameter yet, but this suggests a path to making the messages from the toollabs package a bit better.

fnegri edited projects, added cloud-services-team; removed cloud-services-team (Kanban).Jan 18 2023, 7:33 PM

fnegri moved this task from Kanban to Inbox on the cloud-services-team board.

taavi closed this task as Declined.Mar 14 2024, 5:32 PM

`qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/jobClosed, DeclinedPublicActions

Description

Related Objects

Event Timeline

`qstat -xml` failures lead to incomprehensible error messages from jsub/jstart/job
Closed, DeclinedPublic
Actions