Page MenuHomePhabricator

Remove modules/toollabs/files/host_aliases
Closed, ResolvedPublic

Description

host_aliases was created as an emergency fix when the switch to the new DNS structure (*.eqiad.wmflabs => *.tools.eqiad.wmflabs) failed. Its effects can be confusing. In the long term, we should move to use the "true" host names instead and remove host_aliases.

As host names are probably cached by SGE, to avoid catastrophic failures the modus operandi should probably be:

  1. For one of submit host/execution host/etc., remove the old host names from their respective functions, i. e. disable an execution host and drain all jobs running on it first.
  2. Remove the alias from host_aliases.
  3. Restart the grid master service.
  4. Add the new host name to their respective functions, i. e. add an execution host as usual and enable it.
  5. Increase "one" in 1. to a comfortable number and repeat the process.

Event Timeline

scfc created this task.Aug 18 2015, 5:44 PM
scfc updated the task description. (Show Details)
scfc raised the priority of this task from to Lowest.
scfc added a project: Toolforge.
scfc moved this task to Backlog on the Toolforge board.
scfc added a subscriber: scfc.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptAug 18 2015, 5:44 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
scfc moved this task from Backlog to In Progress on the Toolforge board.Aug 31 2015, 11:28 PM
scfc raised the priority of this task from Lowest to High.

We have run into this problem a couple of times lately, and I'd like to get this out of the way rather sooner than later.

scfc claimed this task.Aug 31 2015, 11:29 PM
scfc added a comment.Aug 31 2015, 11:57 PM

For tools-webgrid-generic-1404:

qmod -d webgrid-generic\@tools-webgrid-generic-1404.eqiad.wmflabs
qconf -mq webgrid-generic
qmod -rj 1766173 499843 499859
qconf -de tools-webgrid-generic-1404.eqiad.wmflabs

For tools-exec-1201:

qconf -de tools-exec-1201

For tools-bastion-02:

qconf -dh tools-bastion-02.eqiad.wmflabs
qconf -ds tools-bastion-02.eqiad.wmflabs

For tools-checker-01 (I can't see which instance has the public IP, but -01 had 208.80.155.255 and -02 208.80.155.229, so assuming the latter is not live):

qconf -ds tools-checker-01.eqiad.wmflabs

For tools-services-02:

qconf -ds tools-services-02.eqiad.wmflabs

Left tools-exec-giftbot alone even though no jobs were running, because that launches its fire works on the first of a month IIRC.

Change 235157 had a related patch set uploaded (by Tim Landscheidt):
Tools: Remove gridengine aliases for some hosts

https://gerrit.wikimedia.org/r/235157

scfc added a comment.Sep 1 2015, 12:12 AM

Plan for after the change gets merged:

  1. On tools-master, sudo service gridengine-master restart. This should be safe and not cause any loss of data.
  2. On tools-exec-1201 and tools-webgrid-generic-1404, sudo service gridengine-exec restart (after checking that there are no jobs running on that machine). This is in case the host_aliases file gets cached by the local daemons.
  3. Undo the commands in T109485#1591260.
  4. Wait a few hours for any signs of trouble.
scfc added a comment.Sep 1 2015, 12:20 AM

Preparation work: Disabled all queues on tools-exec-1218, tools-exec-1401, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1201 and tools-webgrid-lighttpd-1402.

scfc added a comment.Sep 1 2015, 1:34 AM

I have readded tools-bastion-02 as a submit host because people are actually using it (cf. T110982). So the actual switch of the host name would be done between 1. and 3. above, and that should be the maximum span where that host is not available as an submit host.

scfc added a comment.Sep 1 2015, 3:02 AM

After the number of pending jobs grew, I undid the preparation work by:

scfc@tools-bastion-01:~$ for host in tools-exec-1218 tools-exec-1401 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1402; do qmod -e "*@$host"; done
scfc@tools-bastion-01.eqiad.wmflabs changed state of "continuous@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "task@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "mailq@tools-exec-1218.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "continuous@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "task@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "mailq@tools-exec-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-generic@tools-webgrid-generic-1401.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-lighttpd-1201.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01.eqiad.wmflabs changed state of "webgrid-lighttpd@tools-webgrid-lighttpd-1402.eqiad.wmflabs" (enabled)
scfc@tools-bastion-01:~$

Re-enabled it for tools-services-02, it was running webservicemonitor.

scfc added a comment.Sep 1 2015, 8:37 PM

Ah, https://wikitech.wikimedia.org/wiki/Hiera:Tools customizes role::labs::tools::services::active_host. I had only looked at hieradata/, sorry.

scfc added a comment.Sep 1 2015, 8:44 PM

… so disabled tools-services-01 as submit host.

Change 235157 merged by coren:
Tools: Remove gridengine aliases for some hosts

https://gerrit.wikimedia.org/r/235157

scfc added a comment.Sep 24 2015, 2:22 PM

After restarting the grid engine master and execd on tools-exec-1201, qstat -f still showed the queues for that instance as au, but:

scfc@tools-bastion-01:~$ qmod -e \*@tools-exec-1201.tools.eqiad.wmflabs
invalid queue "*@tools-exec-1201.tools.eqiad.wmflabs"
scfc@tools-bastion-01:~$ qmod -e \*@tools-exec-1201.eqiad.wmflabs
Queue instance "continuous@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
Queue instance "mailq@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
Queue instance "task@tools-exec-1201.eqiad.wmflabs" is already in the specified state: enabled
scfc@tools-bastion-01:~$

Then I remembered that the host was still listed in qconf -mhgrp \@general as tools-exec-1201.eqiad.wmflabs, so I changed that there. I then readded the host with qconf -ae, remembered that I should have done qconf -Ae /var/lib/gridengine/etc/exechosts/$hostname, fixed that with qconf -me $hostname. Then I had to restart execd on tools-exec-1201 to be recognized, and the status in qstat -f went to okay.

scfc added a comment.Sep 24 2015, 2:28 PM

tools-bastion-02:

scfc@tools-master:~$ qconf -as tools-bastion-02
tools-bastion-02.tools.eqiad.wmflabs added to submit host list
scfc@tools-master:~$ qconf -ah tools-bastion-02
tools-bastion-02.tools.eqiad.wmflabs added to administrative host list

and:

scfc@tools-bastion-01:~$ qconf -ds tools-bastion-02.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-bastion-02.eqiad.wmflabs" from submit host list
scfc added a comment.Sep 24 2015, 2:31 PM

tools-checker-01 was readded as a submit host (?), but I was bold:

scfc@tools-bastion-01:~$ qconf -as tools-checker-01
tools-checker-01.tools.eqiad.wmflabs added to submit host list
scfc@tools-bastion-01:~$ qconf -ds tools-checker-01.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-checker-01.eqiad.wmflabs" from submit host list
scfc@tools-bastion-01:~$ qconf -ss | fgrep checker-01
tools-checker-01.tools.eqiad.wmflabs
scfc@tools-bastion-01:~$
scfc added a comment.Sep 24 2015, 2:33 PM

tools-services-02 was still a submit host, but tools-services-01 was not:

scfc@tools-bastion-01:~$ qconf -ds tools-services-02.eqiad.wmflabs
scfc@tools-bastion-01.eqiad.wmflabs removed "tools-services-02.eqiad.wmflabs" from submit host list
scfc@tools-bastion-01:~$ qconf -as tools-services-02
tools-services-02.tools.eqiad.wmflabs added to submit host list
scfc@tools-bastion-01:~$ qconf -ss | fgrep services
tools-services-01.eqiad.wmflabs
tools-services-02.tools.eqiad.wmflabs
scfc@tools-bastion-01:~$
scfc added a comment.Sep 24 2015, 2:34 PM

tools-webgrid-generic-1404 has been reenabled (?), so I'll disable it, reschedule the jobs, fix the host name in the host groups, restart execd and reenable the queue.

scfc added a comment.Sep 24 2015, 2:46 PM

I misread the process table: The queue seems to be disabled and no jobs running on this host, but there were processes for the tools clickstream-api, faces and languageproofing (all started August 31st) and valhallasw-testing-tool (started August 14th) running there. I rebooted the host to start from scratch.

scfc added a comment.Sep 26 2015, 3:29 AM

After the mishap with T113614, to get tools-webgrid-generic-1404 working again I did:

  1. Restart gridengine-exec and gridengine-master. The master kept complaining about:
09/26/2015 03:05:05|worker|tools-master|E|commlib error: local host name error (IP based host name resolving "tools-webgrid-generic-1404.eqiad.wmflabs" doesn't match client host name from connect message "tools-webgrid-generic-1404.tools.eqiad.wmflabs")
  1. After long try and error I noticed that (despite host_aliases not listing them), qconf -sel showed both host names. So I removed the old host name with qconf -de (which I probably assumed as done), and after several restarts of gridengine-master and gridengine-exec (perhaps I was just too impatient), they recognized each other again.
  1. I added the host back to the queue with qconf -mq webgrid-generic.

(I also found a warning about tools-exec-wmt not resolving in messages, so I removed that host name with qconf -de as well.)

scfc added a comment.Sep 26 2015, 3:37 AM

Forgot: tools-webgrid-generic-1404.eqiad.wmflabs was referenced as submit host, so I added the .tools. variant and removed the old host name.

Change 241582 had a related patch set uploaded (by Tim Landscheidt):
Tools: Unpuppetize host_aliases

https://gerrit.wikimedia.org/r/241582

Change 241582 abandoned by Tim Landscheidt:
Tools: Unpuppetize host_aliases

https://gerrit.wikimedia.org/r/241582

scfc removed scfc as the assignee of this task.Dec 2 2016, 1:02 AM
scfc removed a project: Patch-For-Review.
scfc moved this task from Waiting for code review to Backlog on the Toolforge board.

Change 496680 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Cleanup host_aliases and exim4 conf for Trusty grid

https://gerrit.wikimedia.org/r/496680

Change 496680 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: Cleanup host_aliases and exim4 conf for Trusty grid

https://gerrit.wikimedia.org/r/496680

Change 499031 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: remove deleted Trusty host aliases

https://gerrit.wikimedia.org/r/499031

Change 499031 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: remove deleted Trusty host aliases

https://gerrit.wikimedia.org/r/499031

bd808 closed this task as Resolved.Mar 27 2019, 11:28 PM
bd808 claimed this task.