Page MenuHomePhabricator

Make gridengine exec hosts also submit hosts
Closed, ResolvedPublic

Description

To allow tools to be able to spawn jobs. Will also need to enforce a per-user limit (128 concurrent jobs?) before this can be enabled, to prevent buggy code from accidentally spawning thousands of jobs.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Toolforge.
yuvipanda subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

I've setup a 128 concurrent job limit.

max_u_jobs is a parameter of sge_conf (qconf -sconf) and controls the
total number of active jobs (running, qw, on hold, etc) for each user.

maxujobs is a parameter of sched_conf (qconf -ssconf) and controls the
total number of jobs a user may have running.

we set the latter, but think we should set the former.

Why? Limiting max_u_jobs would effectively limit the number of jobs a user can submit, thus he would have to account for the possibility that jsub may fail. This would break many setups where for example a job per wiki is submitted without expecting that this could fail.

IIRC we also already had some mechanism that limits the number of concurrently running jobs per user. Let me see if I can find it again.

In T60949#616510, @scfc wrote:

(In reply to comment #1)

Reading the IRC log, I don't quite understand why you need a *node* of your
own. Apparently, you want to run 200 jobs in parallel, and the problem is
the
12 concurrent jobs/user limit. So you really want to have the limit for your
bot raised to 200?

[...]

Just checked: Currently the limit seemed to be defined by:

scfc@tools-login:~$ qconf -srqs
{
name jobs
description NONE
enabled FALSE
limit users {*} queues {continuous,task} to jobs=16
}
scfc@tools-login:~$

*but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done").

I changed "enabled" to "TRUE" and added a first rule:

scfc@tools-login:~$ sudo qconf -srqs
{
name jobs
description NONE
enabled TRUE
limit users scfc to jobs=200
limit users {*} queues {continuous,task} to jobs=16
}
scfc@tools-login:~$

But I was still only able to launch 32 jobs, so I changed it back.

Further digging brought up:

scfc@tools-login:~$ qconf -ssconf
[...]
maxujobs 32
[...]

Ah! I'll test how to set per-user quotas over the next few days.

(No, I didn't as the task took a different turn.) Currently, qconf -srqs is:

{
   name         user_slots
   description  Users have 60 user_slot to allocate
   enabled      TRUE
   limit        users {*} hosts * to user_slot=60
}

(from T54976), so the 16 jobs/queue limit (if it worked ­– I honestly can't remember) got lost in the mean time.

@Giftpflanze actually needs more than this because of their array jobs. I'm setting this to 1000 now, which should be enough even for extreme use cases. It might also still be enough to kill gridengine for jobs that are spread out (as opposed to the single giftbot queue), but at least it's better than infinite.

Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that). As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).

To make this a bit less confusing, let's make this task about setting up execution nodes as submit hosts and T67777 about limitting the number of parallel tasks a user can execute.

scfc triaged this task as Low priority.Feb 16 2017, 10:43 PM
scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that).

At least in theory, -l hostname=<hostname> could be used to select a specific exec node. Making jsub do that by default seems like a bad idea because one of the main reasons to spawn new jobs is to spread load across the grid.

As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).

Is the blocker to IPC via TCP working across gird hosts discovering the other nodes that the jobs are running on, or something else? Off the top of my head I can't think of a reason that a job could not open a TCP socket that can be reached from other instances inside the job grid.

Change 485129 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts

https://gerrit.wikimedia.org/r/485129

Change 485129 merged by Bstorm:
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts

https://gerrit.wikimedia.org/r/485129

Bstorm claimed this task.

All new grid exec hosts (stretch) are now submit hosts.