Apparently the ~wmf1 gridengine package was never installed on gridengine-master. The automatic installation of +wmf2 then caused the entire master to fail.
The underlying issue seems to be /etc/hosts or something like that:
- /var/lib/gridengine/default/common/act_qmaster is being set to 'localhost' by the master,
- 20:39 <YuviPanda> valhallasw: [pid 27542] write(2, "error: sge_gethostbyname failed\n", 32) = 32
We tried:
- setting various combinations of localhost/tools-master in /etc/hosts. See e.g. https://scidom.wordpress.com/2012/01/18/sge-on-single-pc/ http://informatics.malariagen.net/2011/06/01/gridengine-the-ubuntu-debian-way/ and http://talby.rcs.manchester.ac.uk/~ri/_notes_sge/name-and-address-resolution-and-troubleshooting.html
- rebooting the server. This did solve some issues, so DNS was somehow borked as well
Basic behavior:
valhallasw@tools-bastion-01:/data/project/.system/gridengine/default/common$ qstat error: commlib error: got select error (Connection refused) error: unable to send message to qmaster using port 6444 on host "localhost": got send error
then, after forcing act_qmaster to tools-master.eqiad.wmflabs:
error: commlib error: access denied (server host resolves rdata host "tools-bastion-01.eqiad.wmflabs" as "(HOST_NOT_RESOLVABLE)")
we have also seen
error: commlib error: access denied (server host resolves destination host "tools-master.eqiad.wmflabs" as "tools-master")
(after fiddling with /etc/hosts)