Page MenuHomePhabricator

Depool procedure doesn't work in SGE cluster
Closed, ResolvedPublic

Description

As documented here: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Draining_a_node_of_Jobs

root@tools-sgegrid-master:~# exec-manage depool tools-sgeexec-0916.tools.eqiad.wmflabs
Queue instance "task@tools-sgeexec-0916.tools.eqiad.wmflabs" is already in the specified state: disabled
Queue instance "continuous@tools-sgeexec-0916.tools.eqiad.wmflabs" is already in the specified state: disabled
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host
denied: host "tools-sgegrid-master.tools.eqiad.wmflabs" is not a submit host

root@tools-sgebastion-07:~# exec-manage
-su: exec-manage: command not found

Event Timeline

GTirloni created this task.Feb 25 2019, 9:37 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 25 2019, 9:37 AM
GTirloni updated the task description. (Show Details)Feb 25 2019, 9:42 AM
Bstorm added a subscriber: Bstorm.Feb 25 2019, 3:17 PM

I'd removed all the exec node commands last I checked after we tested it recently. Wonder what I missed. Thanks for the heads up!

It cannot work on a bastion and should not need any submit commands.

Bstorm claimed this task.Feb 25 2019, 3:18 PM

If rescheduling is a "submit" command, we'll just make masters submit hosts...which is probably fine in the end.

testing things

Bstorm added a comment.EditedFeb 25 2019, 4:04 PM

I see why it was missed, I believe. Jobs that are not rerunable are given a qdel. That's a submit host command because grid engine is goofy. So this is an edge case, but one that will be common on web nodes.

I'm not sure I can come up with a reason not to enable submit on the masters. Keeping them as the only admin nodes is the more important feature. As long as they cannot be logged into... It's just a quick command

Bstorm added a comment.EditedFeb 25 2019, 4:16 PM

Yup, that's it. It works fine on exec nodes in most cases...but I'm quite sure that qdel will throw an error if I can find a way to trigger it.

Bstorm closed this task as Resolved.Feb 25 2019, 4:54 PM

Added the shadow and master to submit nodes instead of removing that from the script.

Bstorm reopened this task as Open.Mar 8 2019, 9:02 PM

Oops, the configurator script removes what I did. Must add them as submit nodes in puppet to prevent that.

Change 496979 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: Correctly set masters and shadow masters as submithosts

https://gerrit.wikimedia.org/r/496979

Change 496979 merged by Bstorm:
[operations/puppet@production] sonofgridengine: Correctly set masters and shadow masters as submithosts

https://gerrit.wikimedia.org/r/496979

Change 496987 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: move the client package out of master.pp

https://gerrit.wikimedia.org/r/496987

Change 496987 merged by Bstorm:
[operations/puppet@production] sonofgridengine: move the client package out of master.pp

https://gerrit.wikimedia.org/r/496987

Change 496993 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: Read observer creds from file with cli fallback

https://gerrit.wikimedia.org/r/496993

Change 496993 merged by Bstorm:
[operations/puppet@production] sonofgridengine: Read observer creds from file with cli fallback

https://gerrit.wikimedia.org/r/496993

Change 497001 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: remove gridengine-client from shadow_master class

https://gerrit.wikimedia.org/r/497001

Change 497001 merged by Bstorm:
[operations/puppet@production] sonofgridengine: remove gridengine-client from shadow_master class

https://gerrit.wikimedia.org/r/497001

Bstorm closed this task as Resolved.Mar 15 2019, 11:01 PM

Ok, now they stay submit hosts.

Change 497421 merged by CRusnov:
[operations/puppet@production] sonofgridengine: remove gridengine-client from shadow_master class

https://gerrit.wikimedia.org/r/497421