Page MenuHomePhabricator

grid engine master crashes when any tools-services-* host is added to DNS
Closed, InvalidPublic

Description

I've had three cron jobs fail within the last hour with the below error for tools.jjmc89-bot.

Jobs
jstart -e /dev/null -o /dev/null wikinews_importer
jstart -e /dev/null -o /dev/null touch_hourly
jstart -mem 2048m -e /dev/null -o /dev/null bsicons_replacer
Error
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
Traceback (most recent call last):
  File "/usr/bin/job", line 48, in <module>
    root = xml.etree.ElementTree.fromstring(proc.stdout.read())
  File "/usr/lib/python3.4/xml/etree/ElementTree.py", line 1326, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Event Timeline

JJMC89 created this task.Dec 19 2017, 5:16 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 19 2017, 5:16 AM
zhuyifei1999 triaged this task as Unbreak Now! priority.Dec 19 2017, 5:17 AM
zhuyifei1999 added a subscriber: zhuyifei1999.
Restricted Application added subscribers: Liuxinyu970226, Jay8g, TerraCodes. · View Herald TranscriptDec 19 2017, 5:17 AM

That's generating 1 mails per 5 minutes for me. By the way, the host itself is responding, maybe simply rebootin it will solve this issue.

@Andrew did a restart of the service but it crashed again due to segfault.

Oh, ok. Do you know why a backup master isn't running on tools-grid-shadow?

Andrew claimed this task.Dec 19 2017, 5:55 AM
Andrew lowered the priority of this task from Unbreak Now! to High.

The master seems to be up and running again. Previously it was crashing, like this:

Dec 19 05:25:34 tools-grid-master kernel: [15594167.689091] sge_qmaster[13437]: segfault at 68 ip 000000000049d85a sp 00007fc6bf0f8c40 error 6 in sge_qmaster[400000+2a3000]

There are a few theories as to what went wrong, and I don't feel like testing any of them at midnight.

  • I was working on a presumed-unrelated test node, 'tools-services-trustytest' and 'tools-services-jessietest' each of which was puppetized as a live services node. It's possible that there was something crazy happening there where one or the other was confusing the master -- they were submit nodes but they shouldn't really have been doing anything.
  • There are some new prometheus collectors in play; during the troubleshooting process we shut them down, which might have been related to the recovery
  • We also (accidentally) restarted the master as root immediately before recovery. That /could/ have fixed something by cleaning up a stuck log file or other state that was preventing the master from running

I might see if I can reproduce the first suspect tomorrow when I'm more awake.

Andrew renamed this task from error: commlib error: got select error (Connection refused) to grid engine master crashes.Dec 19 2017, 3:37 PM

These should be modifiable to dump state and restore state https://github.com/gridengine/gridengine/tree/master/source/dist/util/upgrade_modules

load_sge_config.sh
save_sge_config.sh

I just confirmed -- creating a host named 'tools-services-temp.tools.eqiad.wmflabs' caused the grid master to immediately crash. This was before I'd even signed the puppet cert request on tools-puppetmaster-01, so the -temp host didn't have any secret credentials.

Deleted the host and restarted the grid master, and everything is fine again.

Something interesting is happening!

Andrew removed Andrew as the assignee of this task.Jan 17 2018, 3:58 PM
bd808 renamed this task from grid engine master crashes to grid engine master crashes when any tools-services-* host is added to DNS.Mar 11 2018, 1:22 AM
Bstorm closed this task as Invalid.Oct 1 2019, 12:04 AM
Bstorm added a subscriber: Bstorm.

This doesn't seem in any way related to the mentioned tasks. The segfault has not been seen in years to my knowledge. We are not using the same version of grid engine, kernel or any other supporting software at this point (including python). I am closing this task, not so much as resolved, but as upgraded away.

The python elementtree error happens whenever anything happens to gridengine because the python parsers in toolforge fail to receive valid XML from the server. That's not the same issue as a segfault in gridengine. Webservice error handling is definitely something we might want to revisit to improve.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptOct 1 2019, 12:04 AM