Page MenuHomePhabricator

Grid engine masters down
Closed, ResolvedPublic

Description

Apparently the ~wmf1 gridengine package was never installed on gridengine-master. The automatic installation of +wmf2 then caused the entire master to fail.

The underlying issue seems to be /etc/hosts or something like that:

  • /var/lib/gridengine/default/common/act_qmaster is being set to 'localhost' by the master,
  • 20:39 <YuviPanda> valhallasw: [pid 27542] write(2, "error: sge_gethostbyname failed\n", 32) = 32

We tried:

Basic behavior:

valhallasw@tools-bastion-01:/data/project/.system/gridengine/default/common$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "localhost": got send error

then, after forcing act_qmaster to tools-master.eqiad.wmflabs:

error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")

we have also seen

error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "tools-master")

(after fiddling with /etc/hosts)

Event Timeline

valhallasw raised the priority of this task from to Unbreak Now!.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, yuvipanda, scfc, coren.

@BBlack jumped in to help, and we seemed to have SGE working again, but it's down again.

Okay, with /etc/hosts being

127.0.0.1 tools-master
10.68.16.9 tools-master.eqiad.wmflabs tools-master

and

22:02 <bblack> I restarted nscd, I think it was caching Bad Things for some reason
22:02 <bblack> just on tools-master

seems to have made it work again.

... and, now that I've typed this, it's broken again:

tools.nlwikibots@tools-bastion-01:~$ qstat
error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")

puppet is disabled on -master and -shadow, and nscd is disabled on -master as well. /etc/hosts is in some hacked up state, and 'tools-master.eqiad.wmflabs' was hand entered into /var/lib/gridengine/default/act_qmaster. Currently intermittent failures still happenint, but not as bad as before.

I've been running:

while /bin/true; do          
    qstat > /dev/null    
    if  [ $? -ne 0 ]; then  
        exit  
    fi 
echo -n .
done

on a shell and it hasn't broken in a while (it used to break).

We aren't sure what the underlying cause is. This started happening right after gridengine-master was restarted due to a package upgrade, and persisted even after the package was reverted to a previous state. Running theory is that whatever broke this was a much earlier change that simply hadn't percolated down due to the fact that there was no service restart.

Experiencing errors myself, attempting to start a webservice

error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")
error: unable to contact qmaster using port 6444 on host "tools-master"
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 274, in <module>
    main()
  File "/usr/local/bin/webservice", line 233, in main
    job = get_job_xml(job_name)
  File "/usr/local/bin/webservice", line 79, in get_job_xml
    output = subprocess.check_output(['qstat', '-xml'])
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['qstat', '-xml']' returned non-zero exit status 1

Hope the log helps...

It failed twice and then started working right away afterwards :| I did a dig on tools-master and that was the only intervening thing I did between it failind and not failing, but I don't know / think that's relevant.

Note sure if this is related, but the felt frequency of mails from sudo:

From:  <diamond@tools.wmflabs.org>
Subject: *** SECURITY information for tools-redis-slave ***
To: root@tools.wmflabs.org
Date: Mon, 25 May 2015 19:52:35 +0000 (2 days, 2 hours ago)

tools-redis-slave : May 25 19:52:35 : diamond : unable to resolve host tools-redis-slave

has increased in the past few days.

@BBlack figured that the problems are perhaps caused by too big /etc/hosts files, and reverting them seems to have fixed the issues.

Change 214338 had a related patch set uploaded (by Yuvipanda):
tools: Include labsdb aliases only in exec hosts

https://gerrit.wikimedia.org/r/214338

Change 214338 merged by Yuvipanda:
tools: Include labsdb aliases only in exec hosts

https://gerrit.wikimedia.org/r/214338

I've puppetized the fix (remove /etc/hosts generation for labsdb from tools-master and -shadow), and everything seems to be ok. Filing a load of followup bugs atm.

Puppet is enabled and so is nscd on both the hosts.

yuvipanda lowered the priority of this task from Unbreak Now! to Medium.May 28 2015, 3:11 PM

(resetting priority now that gridengine masters are not down)

valhallasw claimed this task.