Page MenuHomePhabricator

tools-exec-1401/-catscan down (alarm/unknown) due to incorrect DNS/host_aliases
Closed, ResolvedPublic

Description

Currently,

  • tools-exec-1401 and
  • tools-exec-catscan

are effectively offline due to DNS issues. Their queues

valhallasw@tools-bastion-01:~$ qstat -f | grep 'exec-1401\|catscan'
mailq@tools-exec-1401.tools.eq BP    0/0/5          -NA-     -NA-          au
task@tools-exec-1401.tools.eqi BIP   0/0/50         -NA-     -NA-          au
continuous@tools-exec-1401.too BC    0/0/50         -NA-     -NA-          au
catscan@tools-exec-catscan.too BIC   0/0/1000       -NA-     -NA-          au

are all in au (alarm, unknown) state. The cause of this is that SGE actually knows the hosts as

valhallasw@tools-bastion-01:/data/project/admin/public_html$ qhost -h tools-exec-1401 tools-exec-catscan
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1401.eqiad.wmflabs lx26-amd64      4  0.01    7.8G  234.0M   23.9G     0.0
tools-exec-catscan.eqiad.wmflabs lx26-amd64      4  0.01    7.8G  233.7M    1.9G     0.0

i.e. without .tools. in the hostname, while the queues do have .tools. in the hostname.

See also T109485: Remove modules/toollabs/files/host_aliases

Event Timeline

valhallasw raised the priority of this task from to Needs Triage.
valhallasw updated the task description. (Show Details)
valhallasw added a project: Toolforge.
valhallasw added subscribers: valhallasw, coren, scfc.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
valhallasw renamed this task from SGE hosts down (alarm/unknown) due to incorrect DNS/host_aliases to tools-exec-1401/-catscan down (alarm/unknown) due to incorrect DNS/host_aliases .Aug 19 2015, 6:20 PM
valhallasw set Security to None.
valhallasw updated the task description. (Show Details)
valhallasw added a subscriber: Kolossos.

I don't think that this is the cause. For example:

scfc@tools-bastion-01:~$ qhost -h tools-exec-1402
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
tools-exec-1402.eqiad.wmflabs lx26-amd64      4  0.31    7.8G  834.1M   23.9G     0.0
scfc@tools-bastion-01:~$

So the host name without .tools. is the norm.

Sorry, I clarified it in the task description. It's caused by the discrepancy between the queue name and the hostname. The queues are named ....@tools-exec-1401.tools.eqiad, while the hosts are called tools-exec-1401.eqiad.

tools-exec-1402's queues don't have .tools.:

valhallasw@tools-bastion-01:~$ qstat -f | grep 'exec-1402'
mailq@tools-exec-1402.eqiad.wm BP    0/0/5          0.10     lx26-amd64
task@tools-exec-1402.eqiad.wmf BIP   0/0/50         0.10     lx26-amd64
continuous@tools-exec-1402.eqi BC    0/5/50         0.10     lx26-amd64

Deleting them doesn't seem to work:

scfc@tools-bastion-01:~$ qconf -dq mailq@tools-exec-1401
denied: cluster queue "mailq@tools-exec-1401" does not exist
scfc@tools-bastion-01:~$ qconf -dq mailq@tools-exec-1401.tools.eqiad.wmflabs
denied: cluster queue "mailq@tools-exec-1401.tools.eqiad.wmflabs" does not exist
scfc@tools-bastion-01:~$ qconf -dq mailq@tools-exec-1401.eqiad.wmflabs
denied: cluster queue "mailq@tools-exec-1401.eqiad.wmflabs" does not exist
scfc@tools-bastion-01:~$

I qconf -mhgrp \@general, removed .tools. from tools-exec-1401's host name and now:

scfc@tools-bastion-01:~$ qstat -f | grep 'exec-1401'
mailq@tools-exec-1401.eqiad.wm BP    0/0/5          0.14     lx26-amd64    
task@tools-exec-1401.eqiad.wmf BIP   0/0/50         0.14     lx26-amd64    
continuous@tools-exec-1401.eqi BC    0/0/50         0.14     lx26-amd64    
scfc@tools-bastion-01:~$

For tools-exec-catscan, I qconf -mq catscan and removed .tools. from the host listed there:

scfc@tools-bastion-01:~$ qstat -f | grep catscan
catscan@tools-exec-catscan.eqi BIC   0/0/1000       0.13     lx26-amd64    
scfc@tools-bastion-01:~$

Now qstat -f shows no queues in non-normal status. Is that correct?

(And if something could write up which command to use for which function, that would be much appreciated :-). Every time I try to tackle one of those issues, I basically start at scratch.)

Yep, looks good to me. The only oddity left is

valhallasw@tools-bastion-01:~$ qstat -f -xml | less | grep '\.tools'
      <name>webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs</name>

where both the host /and/ the queue have the new DNS name.

AFAIUI, that is to be expected as tools-webgrid-lighttpd-1411 is not in host_aliases. In other words (and again: AFAIUI), if a host is in host_aliases, the alias must be used consistently, if not, the "true" host name.

valhallasw claimed this task.

OK. Then I suggest to keep this one as .tools. and we can then slowly migrate towards a full .tools. environment.