As of this task, the exec-manage script fails to reschedule jobs when depooling a node because the grid master is not a submit host.
The code:
case $cmd in depool) check_missing_param $# # Collect the list of jobs running on this host, and convert them # to pipe separated string, this is useful to show status of these jobs # after the drain job_list=`/usr/bin/qhost -j -h $exec_host | awk '{print $1; }' | grep -E ^[0-9] | awk -vORS='|' '{print $1; }'` # Disable all the queues running on this host. The *@ is a special # syntax that means 'all queues @ host' /usr/bin/qmod -d "*@$exec_host" # List all the jobs running on the host, and attempt to reschedule them, # match jobs that say 'are not rerunable' and delete them (these need # to be rescheduled manually) /usr/bin/qhost -j -h $exec_host | awk '{ print $1; }' | egrep ^[0-9] | xargs -L1 qmod -rj | grep 'are not rerunable' | awk '{ print $3; }' | xargs --no-run-if-empty -L1 qdel echo "This exec node has been depooled, and jobs that were running \ prior have been rescheduled (if rerunable). Current status: " /usr/bin/qhost -j | grep -E "${job_list%|*}" ;;
Result:
aborrero@tools-sgegrid-master:~$ sudo exec-manage depool tools-sgeexec-0913 Queue instance "continuous@tools-sgeexec-0913.tools.eqiad.wmflabs" is already in the specified state: disabled Queue instance "task@tools-sgeexec-0913.tools.eqiad.wmflabs" is already in the specified state: disabled denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host denied: host "tools-sgegrid-master.tools.eqiad1.wikimedia.cloud" is not a submit host aborrero@tools-sgegrid-master:~$ echo $? 123
This also makes our cookbook for depooling hosts fail with an ugly traceback:
arturo@nostromo:~ [spicerack] $ cookbook wmcs.toolforge.grid.node.lib.depool --project tools --node-hostname tools-sgeexec-0913 START - Cookbook wmcs.toolforge.grid.node.lib.depool PASS | | 0% (0/1) [00:07<?, ?hosts/s] FAIL |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:07<00:00, 7.06s/hosts] Exception raised while executing cookbook wmcs.toolforge.grid.node.lib.depool: Traceback (most recent call last): File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/_menu.py", line 234, in run raw_ret = runner.run() File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/toolforge/grid/node/lib/depool.py", line 97, in run grid_controller.depool_node(host_fqdn=node) File "/home/arturo/git/wmf/operations/cookbooks/cookbooks/wmcs/toolforge/grid/__init__.py", line 307, in depool_node self._master_node.run_sync(f"exec-manage depool {hostname}", print_output=False) File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 520, in run_sync return self._execute( File "/home/arturo/git/wmf/operations/software/spicerack/spicerack/remote.py", line 720, in _execute raise RemoteExecutionError(ret, "Cumin execution failed") spicerack.remote.RemoteExecutionError: Cumin execution failed (exit_code=2) END (FAIL) - Cookbook wmcs.toolforge.grid.node.lib.depool (exit_code=99)