Page MenuHomePhabricator

Toolforge: figure out how to work with the new domain in the grid
Closed, ResolvedPublic

Description

The Toolforge grid migration from Stretch to Buster implies we are migrating away from the tools.eqiad.wmflabs domain into tools.eqiad1.wikimedia.cloud. In toolsbeta, the whole grid uses toolsbeta.eqiad1.wikimedia.cloud already.

CURRENT STATUS AS OF TASK FILLING:

  • grid master == stretch
  • grid shadow == buster

Some issues detected:

  • the hiera key that stores the master will need refresh when the grid master is migrated to buster, for example:
-sonofgridengine::gridmaster: tools-sgegrid-master.tools.eqiad.wmflabs
+sonofgridengine::gridmaster: tools-sgegrid-master.tools.eqiad1.wikimedia.cloud
  • the grid master is somehow encoding the domain in the configuration:
aborrero@tools-sgegrid-master:~$ sudo qconf -ss | grep sgegrid
tools-sgegrid-master.tools.eqiad.wmflabs
tools-sgegrid-shadow.tools.eqiad.wmflabs
aborrero@tools-sgegrid-master:~$ sudo qconf -dh tools-sgegrid-shadow.tools.eqiad.wmflabs
can't resolve hostname "tools-sgegrid-shadow.tools.eqiad.wmflabs"
aborrero@tools-sgegrid-master:~$ sudo qconf -ah tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud
tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud added to administrative host list
aborrero@tools-sgegrid-master:~$ sudo qconf -ss | grep sgegrid
tools-sgegrid-master.tools.eqiad.wmflabs
tools-sgegrid-shadow.tools.eqiad.wmflabs

aborrero@tools-sgegrid-shadow:~$ qstat -f
error: commlib error: access denied (server host resolves rdata host "tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud" as "tools-sgegrid-shadow.tools.eqiad.wmflabs")
error: unable to contact qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad.wmflabs"
  • I've detected a few places where this might be hardcoded:
aborrero@tools-sgegrid-master:/var/lib/gridengine$ cat default/common/shadow_masters
tools-sgegrid-master.tools.eqiad.wmflabs
tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud
tools-sgegrid-shadow.tools.eqiad.wmflabs
  • the master daemon insists on the old domain:
Mar 23 12:46:07 tools-sgegrid-master sge_qmaster[3764]: can't resolve host name "tools-sgegrid-shadow.tools.eqiad.wmflabs": undefined commlib error code
Mar 23 12:46:07 tools-sgegrid-master sge_qmaster[3764]: can't resolve host name "tools-sgegrid-shadow.tools.eqiad.wmflabs": undefined commlib error code

Event Timeline

This apparently fixed itself, but some weird things remain:

aborrero@tools-sgegrid-shadow:~$ qstat
error: commlib error: access denied (server host resolves rdata host "tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud" as "tools-sgegrid-shadow.tools.eqiad.wmflabs")
error: unable to contact qmaster using port 6444 on host "tools-sgegrid-master.tools.eqiad.wmflabs"

aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
[..]
Mar 23 12:46:07 tools-sgegrid-master sge_qmaster[3764]: can't resolve host name "tools-sgegrid-shadow.tools.eqiad.wmflabs": undefined commlib error code
Mar 23 12:46:07 tools-sgegrid-master sge_qmaster[3764]: can't resolve host name "tools-sgegrid-shadow.tools.eqiad.wmflabs": undefined commlib error code
[..]

aborrero@tools-sgegrid-shadow:~$ qstat
error: denied: host "tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud" is neither submit nor admin host

aborrero@tools-sgegrid-master:~$ sudo qconf -ah tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud
tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud added to administrative host list

aborrero@tools-sgegrid-shadow:~$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
task@tools-sgeexec-0901.tools. BI    0/8/50         0.75     lx-amd64      
---------------------------------------------------------------------------------

aborrero@tools-sgegrid-master:$ qconf -ss | grep sgegrid
tools-sgegrid-master.tools.eqiad.wmflabs
tools-sgegrid-shadow.tools.eqiad.wmflabs

In both tools and toolsbeta projects we have now buster master/shadow servers. They are working fine.

We don't plan on introducing new domains in the foreseeable future. So I think this task can be closed, even though no concrete solution was found, we don't expect to find this problem ever again.