To allow tools to be able to spawn jobs. Will also need to enforce a per-user limit (128 concurrent jobs?) before this can be enabled, to prevent buggy code from accidentally spawning thousands of jobs.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
sonofgridengine: make exec hosts into submit hosts | operations/puppet | production | +1 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Bstorm | T199271 Upgrade the tools gridengine system | |||
Resolved | Bstorm | T123270 Make gridengine exec hosts also submit hosts | |||
Resolved | Bstorm | T67777 Limit number of jobs users can execute in parallel | |||
Resolved | Bstorm | T213183 Set up puppet to handle the global and scheduler configuration of gridengine |
Event Timeline
max_u_jobs is a parameter of sge_conf (qconf -sconf) and controls the total number of active jobs (running, qw, on hold, etc) for each user. maxujobs is a parameter of sched_conf (qconf -ssconf) and controls the total number of jobs a user may have running.
we set the latter, but think we should set the former.
Why? Limiting max_u_jobs would effectively limit the number of jobs a user can submit, thus he would have to account for the possibility that jsub may fail. This would break many setups where for example a job per wiki is submitted without expecting that this could fail.
IIRC we also already had some mechanism that limits the number of concurrently running jobs per user. Let me see if I can find it again.
(No, I didn't as the task took a different turn.) Currently, qconf -srqs is:
{ name user_slots description Users have 60 user_slot to allocate enabled TRUE limit users {*} hosts * to user_slot=60 }
(from T54976), so the 16 jobs/queue limit (if it worked – I honestly can't remember) got lost in the mean time.
@Giftpflanze actually needs more than this because of their array jobs. I'm setting this to 1000 now, which should be enough even for extreme use cases. It might also still be enough to kill gridengine for jobs that are spread out (as opposed to the single giftbot queue), but at least it's better than infinite.
Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that). As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).
To make this a bit less confusing, let's make this task about setting up execution nodes as submit hosts and T67777 about limitting the number of parallel tasks a user can execute.
At least in theory, -l hostname=<hostname> could be used to select a specific exec node. Making jsub do that by default seems like a bad idea because one of the main reasons to spawn new jobs is to spread load across the grid.
As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).
Is the blocker to IPC via TCP working across gird hosts discovering the other nodes that the jobs are running on, or something else? Off the top of my head I can't think of a reason that a job could not open a TCP socket that can be reached from other instances inside the job grid.
Change 485129 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts
Since concurrent jobs are now limited by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-scheduler-config#6 and by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-global-config#43
I think we can set the stretch grid exec nodes to have submit powers.
Change 485129 merged by Bstorm:
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts