Grid engine masters down
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	valhallasw
	May 27 2015, 7:53 PM

Description

Apparently the ~wmf1 gridengine package was never installed on gridengine-master. The automatic installation of +wmf2 then caused the entire master to fail.

The underlying issue seems to be /etc/hosts or something like that:

/var/lib/gridengine/default/common/act_qmaster is being set to 'localhost' by the master,
20:39 <YuviPanda> valhallasw: [pid 27542] write(2, "error: sge_gethostbyname failed\n", 32) = 32

We tried:

setting various combinations of localhost/tools-master in /etc/hosts. See e.g. https://scidom.wordpress.com/2012/01/18/sge-on-single-pc/ http://informatics.malariagen.net/2011/06/01/gridengine-the-ubuntu-debian-way/ and http://talby.rcs.manchester.ac.uk/~ri/_notes_sge/name-and-address-resolution-and-troubleshooting.html
rebooting the server. This did solve some issues, so DNS was somehow borked as well

Basic behavior:

valhallasw@tools-bastion-01:/data/project/.system/gridengine/default/common$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "localhost": got send error

then, after forcing act_qmaster to tools-master.eqiad.wmflabs:

error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")

we have also seen

error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "tools-master")

(after fiddling with /etc/hosts)

Details

	Subject	Repo	Branch	Lines +/-
	tools: Include labsdb aliases only in exec hosts	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	valhallasw	T100554 Grid engine masters down
Declined	scfc	T100660 Test if grid engine master non-failure depends on the lengths of /etc/hosts lines
Resolved	yuvipanda	T100662 Figure out why exec_environ was included in gridengine master / shadow
Declined	yuvipanda	T100564 Investigate why nscd is used in labs
Resolved	coren	T90546 Test and verify that OGE master/shadow failover works as expected
Resolved	Krenair	T63897 Move LabsDB aliases to DNS
Resolved	scfc	T91733 /usr/bin/sql should query DNS as well to determine whether a database has been replicated
Resolved	Andrew	T93691 dhclient overwrites /etc/resolv.conf
Resolved	coren	T95288 Designate should support split horizon resolution to yield private IP of instances behind a public DNS entry
Resolved	Andrew	T99133 New server for labs dns recursor

Event Timeline

valhallasw created this task.May 27 2015, 7:53 PM

valhallasw raised the priority of this task from to Unbreak Now!.

valhallasw updated the task description. (Show Details)

valhallasw added a project: Toolforge.

valhallasw added subscribers: valhallasw, yuvipanda, scfc, coren.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 27 2015, 7:53 PM

@BBlack jumped in to help, and we seemed to have SGE working again, but it's down again.

Okay, with /etc/hosts being

127.0.0.1 tools-master
10.68.16.9 tools-master.eqiad.wmflabs tools-master

and

22:02 <bblack> I restarted nscd, I think it was caching Bad Things for some reason
22:02 <bblack> just on tools-master

seems to have made it work again.

... and, now that I've typed this, it's broken again:

tools.nlwikibots@tools-bastion-01:~$ qstat
error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")

jeremyb-phone set Security to None.May 27 2015, 9:08 PM

jeremyb-phone added a subscriber: jeremyb.

puppet is disabled on -master and -shadow, and nscd is disabled on -master as well. /etc/hosts is in some hacked up state, and 'tools-master.eqiad.wmflabs' was hand entered into /var/lib/gridengine/default/act_qmaster. Currently intermittent failures still happenint, but not as bad as before.

yuvipanda mentioned this in T100564: Investigate why nscd is used in labs.May 27 2015, 9:17 PM

Ricordisamoa subscribed.May 27 2015, 9:20 PM

I've been running:

while /bin/true; do          
    qstat > /dev/null    
    if  [ $? -ne 0 ]; then  
        exit  
    fi 
echo -n .
done

on a shell and it hasn't broken in a while (it used to break).

We aren't sure what the underlying cause is. This started happening right after gridengine-master was restarted due to a package upgrade, and persisted even after the package was reverted to a previous state. Running theory is that whatever broke this was a much earlier change that simply hadn't percolated down due to the fact that there was no service restart.

Experiencing errors myself, attempting to start a webservice

error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")
error: unable to contact qmaster using port 6444 on host "tools-master"
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 274, in <module>
    main()
  File "/usr/local/bin/webservice", line 233, in main
    job = get_job_xml(job_name)
  File "/usr/local/bin/webservice", line 79, in get_job_xml
    output = subprocess.check_output(['qstat', '-xml'])
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['qstat', '-xml']' returned non-zero exit status 1

Hope the log helps...

It failed twice and then started working right away afterwards :| I did a dig on tools-master and that was the only intervening thing I did between it failind and not failing, but I don't know / think that's relevant.

Note sure if this is related, but the felt frequency of mails from sudo:

From:  <diamond@tools.wmflabs.org>
Subject: *** SECURITY information for tools-redis-slave ***
To: root@tools.wmflabs.org
Date: Mon, 25 May 2015 19:52:35 +0000 (2 days, 2 hours ago)

tools-redis-slave : May 25 19:52:35 : diamond : unable to resolve host tools-redis-slave

has increased in the past few days.

zhuyifei1999 subscribed.May 28 2015, 10:44 AM

@BBlack figured that the problems are perhaps caused by too big /etc/hosts files, and reverting them seems to have fixed the issues.

Change 214338 had a related patch set uploaded (by Yuvipanda):
tools: Include labsdb aliases only in exec hosts

https://gerrit.wikimedia.org/r/214338

gerritbot added a project: Patch-For-Review.May 28 2015, 1:45 PM

Change 214338 merged by Yuvipanda:
tools: Include labsdb aliases only in exec hosts

https://gerrit.wikimedia.org/r/214338

yuvipanda mentioned this in rOPUP5d6e8e5425d2: tools: Include labsdb aliases only in exec hosts.May 28 2015, 1:52 PM

I've puppetized the fix (remove /etc/hosts generation for labsdb from tools-master and -shadow), and everything seems to be ok. Filing a load of followup bugs atm.

Puppet is enabled and so is nscd on both the hosts.

yuvipanda added a subtask: T100564: Investigate why nscd is used in labs.May 28 2015, 3:02 PM

yuvipanda mentioned this in T90546: Test and verify that OGE master/shadow failover works as expected.May 28 2015, 3:08 PM

yuvipanda added a subtask: T90546: Test and verify that OGE master/shadow failover works as expected.