Page MenuHomePhabricator

exec-manage fails because the grid master is not a submit host
Closed, ResolvedPublic

Description

As of this task, the exec-manage script fails to reschedule jobs when depooling a node because the grid master is not a submit host.

The code:

case $cmd in
    depool)
        check_missing_param $#

        # Collect the list of jobs running on this host, and convert them
        # to pipe separated string, this is useful to show status of these jobs
        # after the drain
        job_list=`/usr/bin/qhost -j -h $exec_host |
                      awk '{print $1; }' |
                      grep -E ^[0-9] |
                      awk -vORS='|' '{print $1; }'`


        # Disable all the queues running on this host. The *@ is a special
        # syntax that means 'all queues @ host'
        /usr/bin/qmod -d "*@$exec_host"

        # List all the jobs running on the host, and attempt to reschedule them,
        # match jobs that say 'are not rerunable' and delete them (these need
        # to be rescheduled manually)
        /usr/bin/qhost -j -h $exec_host |
            awk '{ print $1; }' |
            egrep ^[0-9] |
            xargs -L1 qmod -rj |
            grep 'are not rerunable' |
            awk '{ print $3; }' |
            xargs --no-run-if-empty -L1 qdel

        echo "This exec node has been depooled, and jobs that were running \
              prior have been rescheduled (if rerunable). Current status: "
        /usr/bin/qhost -j | grep -E "${job_list%|*}"

        ;;

Result:

aborrero@tools-sgegrid-master:~$ sudo exec-manage depool tools-sgeexec-0913
Queue instance "continuous@tools-sgeexec-0913.tools.eqiad.wmflabs" is already in the specified state: disabled
Queue instance "task@tools-sgeexec-0913.tools.eqiad.wmflabs" is already in the specified state: disabled
denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host
aborrero@tools-sgegrid-master:~$ echo $?
123

This also makes our cookbook for depooling hosts fail with an ugly traceback:

arturo@nostromo:~ [spicerack] $ cookbook wmcs.toolforge.grid.node.lib.depool --project tools --node-hostname tools-sgeexec-0913
START - Cookbook wmcs.toolforge.grid.node.lib.depool
PASS |                                                                                                                                                                             |   0% (0/1) [00:07<?, ?hosts/s]
FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00,  7.06s/hosts]
Exception raised while executing cookbook wmcs.toolforge.grid.node.lib.depool:
Traceback (most recent call last):
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/toolforge/grid/node/lib/depool.py", line 97, in run
    grid_controller.depool_node(host_fqdn=node)
  File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/toolforge/grid/__init__.py", line 307, in depool_node
    self._master_node.run_sync(f"exec-manage depool {hostname}", print_output=False)
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 520, in run_sync
    return self._execute(
  File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 720, in _execute
    raise RemoteExecutionError(ret, "Cumin execution failed")
spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2)
END (FAIL) - Cookbook wmcs.toolforge.grid.node.lib.depool (exit_code=99)

Event Timeline

aborrero triaged this task as Medium priority.Mar 1 2022, 10:22 AM
aborrero created this task.
taavi claimed this task.

Fixed.