Page MenuHomePhabricator

Remove modules/toollabs/files/host_aliases
Closed, ResolvedPublic

Description

host_aliases was created as an emergency fix when the switch to the new DNS structure (*.eqiad.wmflabs => *.tools.eqiad.wmflabs) failed. Its effects can be confusing. In the long term, we should move to use the "true" host names instead and remove host_aliases.

As host names are probably cached by SGE, to avoid catastrophic failures the modus operandi should probably be:

  1. For one of submit host/execution host/etc., remove the old host names from their respective functions, i. e. disable an execution host and drain all jobs running on it first.
  2. Remove the alias from host_aliases.
  3. Restart the grid master service.
  4. Add the new host name to their respective functions, i. e. add an execution host as usual and enable it.
  5. Increase "one" in 1. to a comfortable number and repeat the process.

Event Timeline

scfc raised the priority of this task from to Lowest.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc moved this task to Ready to be worked on on the Toolforge board.
scfc subscribed.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
scfc raised the priority of this task from Lowest to High.Aug 31 2015, 11:28 PM
scfc moved this task from Ready to be worked on to In Progress on the Toolforge board.

We have run into this problem a couple of times lately, and I'd like to get this out of the way rather sooner than later.

For tools-webgrid-generic-1404:

qmod -d webgrid-generic\@tools-webgrid-generic-1404.eqiad.wmflabs
qconf -mq webgrid-generic
qmod -rj 1766173 499843 499859
qconf -de tools-webgrid-generic-1404.eqiad.wmflabs

For tools-exec-1201:

qconf -de tools-exec-1201

For tools-bastion-02:

qconf -dh tools-bastion-02.eqiad.wmflabs
qconf -ds tools-bastion-02.eqiad.wmflabs

For tools-checker-01 (I can't see which instance has the public IP, but -01 had 208.80.155.255 and -02 208.80.155.229, so assuming the latter is not live):

qconf -ds tools-checker-01.eqiad.wmflabs

For tools-services-02:

qconf -ds tools-services-02.eqiad.wmflabs

Left tools-exec-giftbot alone even though no jobs were running, because that launches its fire works on the first of a month IIRC.

Change 235157 had a related patch set uploaded (by Tim Landscheidt):
Tools: Remove gridengine aliases for some hosts

https://gerrit.wikimedia.org/r/235157

Plan for after the change gets merged:

  1. On tools-master, sudo service gridengine-master restart. This should be safe and not cause any loss of data.
  2. On tools-exec-1201 and tools-webgrid-generic-1404, sudo service gridengine-exec restart (after checking that there are no jobs running on that machine). This is in case the host_aliases file gets cached by the local daemons.
  3. Undo the commands in T109485#1591260.
  4. Wait a few hours for any signs of trouble.

Preparation work: Disabled all queues on tools-exec-1218, tools-exec-1401, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1201 and tools-webgrid-lighttpd-1402.

I have readded tools-bastion-02 as a submit host because people are actually using it (cf. T110982). So the actual switch of the host name would be done between 1. and 3. above, and that should be the maximum span where that host is not available as an submit host.

After the number of pending jobs grew, I undid the preparation work by:

scfc@tools-bastion-01:~$ for host in tools-exec-1218 tools-exec-1401 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1402; do qmod -e "*@$host"; done
scfc@tools-bastion-01.eqiad.wmflabs changed state of "continuous@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "task@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "mailq@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "continuous@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "task@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "mailq@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-generic@tools-webgrid-generic-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-lighttpd-1201.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-lighttpd-1402.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01:~$

Re-enabled it for tools-services-02, it was running webservicemonitor.

Ah, https://wikitech.wikimedia.org/wiki/Hiera:Tools customizes role::labs::tools::services::active_host. I had only looked at hieradata/, sorry.

… so disabled tools-services-01 as submit host.

Change 235157 merged by coren:
Tools: Remove gridengine aliases for some hosts

https://gerrit.wikimedia.org/r/235157

After restarting the grid engine master and execd on tools-exec-1201, qstat -f still showed the queues for that instance as au, but:

scfc@tools-bastion-01:~$ qmod -e \*@tools-exec-1201.tools.eqiad.wmflabs
invalid queue "*@tools-exec-1201.tools.eqiad.wmflabs"
scfc@tools-bastion-01:~$ qmod -e \*@tools-exec-1201.eqiad.wmflabs
Queue instance "continuous@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
Queue instance "mailq@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
Queue instance "task@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
scfc@tools-bastion-01:~$

Then I remembered that the host was still listed in qconf -mhgrp \@general as tools-exec-1201.eqiad.wmflabs, so I changed that there. I then readded the host with qconf -ae, remembered that I should have done qconf -Ae /var/lib/gridengine/etc/exechosts/$hostname, fixed that with qconf -me $hostname. Then I had to restart execd on tools-exec-1201 to be recognized, and the status in qstat -f went to okay.

tools-bastion-02:

scfc@tools-master:~$ qconf -as tools-bastion-02
tools-bastion-02.tools.eqiad.wmflabs added to submit host list
scfc@tools-master:~$ qconf -ah tools-bastion-02
tools-bastion-02.tools.eqiad.wmflabs added to administrative host list

and:

scfc@tools-bastion-01:~$ qconf -ds tools-bastion-02.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-bastion-02.eqiad.wmflabs" from submit host list

tools-checker-01 was readded as a submit host (?), but I was bold:

scfc@tools-bastion-01:~$ qconf -as tools-checker-01
tools-checker-01.tools.eqiad.wmflabs added to submit host list
scfc@tools-bastion-01:~$ qconf -ds tools-checker-01.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-checker-01.eqiad.wmflabs" from submit host list
scfc@tools-bastion-01:~$ qconf -ss | fgrep checker-01
tools-checker-01.tools.eqiad.wmflabs
scfc@tools-bastion-01:~$

tools-services-02 was still a submit host, but tools-services-01 was not:

scfc@tools-bastion-01:~$ qconf -ds tools-services-02.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-services-02.eqiad.wmflabs" from submit host list
scfc@tools-bastion-01:~$ qconf -as tools-services-02
tools-services-02.tools.eqiad.wmflabs added to submit host list
scfc@tools-bastion-01:~$ qconf -ss | fgrep services
tools-services-01.eqiad.wmflabs
tools-services-02.tools.eqiad.wmflabs
scfc@tools-bastion-01:~$

tools-webgrid-generic-1404 has been reenabled (?), so I'll disable it, reschedule the jobs, fix the host name in the host groups, restart execd and reenable the queue.

I misread the process table: The queue seems to be disabled and no jobs running on this host, but there were processes for the tools clickstream-api, faces and languageproofing (all started August 31st) and valhallasw-testing-tool (started August 14th) running there. I rebooted the host to start from scratch.

After the mishap with T113614, to get tools-webgrid-generic-1404 working again I did:

  1. Restart gridengine-exec and gridengine-master. The master kept complaining about:
09/26/2015 03:05:05|worker|tools-master|E|commlib error: local host name error (IP based host name resolving "tools-webgrid-generic-1404.eqiad.wmflabs" doesn't match client host name from connect message "tools-webgrid-generic-1404.tools.eqiad.wmflabs")
  1. After long try and error I noticed that (despite host_aliases not listing them), qconf -sel showed both host names. So I removed the old host name with qconf -de (which I probably assumed as done), and after several restarts of gridengine-master and gridengine-exec (perhaps I was just too impatient), they recognized each other again.
  1. I added the host back to the queue with qconf -mq webgrid-generic.

(I also found a warning about tools-exec-wmt not resolving in messages, so I removed that host name with qconf -de as well.)

Forgot: tools-webgrid-generic-1404.eqiad.wmflabs was referenced as submit host, so I added the .tools. variant and removed the old host name.

Change 241582 had a related patch set uploaded (by Tim Landscheidt):
Tools: Unpuppetize host_aliases

https://gerrit.wikimedia.org/r/241582

Change 241582 abandoned by Tim Landscheidt:
Tools: Unpuppetize host_aliases

https://gerrit.wikimedia.org/r/241582

scfc removed scfc as the assignee of this task.Dec 2 2016, 1:02 AM
scfc removed a project: Patch-For-Review.
scfc moved this task from Waiting for code review to Ready to be worked on on the Toolforge board.

Change 496680 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Cleanup host_aliases and exim4 conf for Trusty grid

https://gerrit.wikimedia.org/r/496680

Change 496680 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Cleanup host_aliases and exim4 conf for Trusty grid

https://gerrit.wikimedia.org/r/496680

Change 499031 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: remove deleted Trusty host aliases

https://gerrit.wikimedia.org/r/499031

Change 499031 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: remove deleted Trusty host aliases

https://gerrit.wikimedia.org/r/499031

bd808 claimed this task.