Make gridengine exec hosts also submit hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Jan 11 2016, 7:19 PM

Description

To allow tools to be able to spawn jobs. Will also need to enforce a per-user limit (128 concurrent jobs?) before this can be enabled, to prevent buggy code from accidentally spawning thousands of jobs.

Details

	Subject	Repo	Branch	Lines +/-
	sonofgridengine: make exec hosts into submit hosts	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Bstorm	T199271 Upgrade the tools gridengine system
Resolved	• Bstorm	T123270 Make gridengine exec hosts also submit hosts
Resolved	• Bstorm	T67777 Limit number of jobs users can execute in parallel
Resolved	• Bstorm	T213183 Set up puppet to handle the global and scheduler configuration of gridengine

Event Timeline

yuvipanda created this task.Jan 11 2016, 7:19 PM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added a project: Toolforge.

yuvipanda subscribed.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 11 2016, 7:19 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

valhallasw mentioned this in T123121: Linkwatcher spawns many processes without parent.Jan 11 2016, 7:24 PM

I've setup a 128 concurrent job limit.

max_u_jobs is a parameter of sge_conf (qconf -sconf) and controls the
total number of active jobs (running, qw, on hold, etc) for each user.

maxujobs is a parameter of sched_conf (qconf -ssconf) and controls the
total number of jobs a user may have running.

we set the latter, but think we should set the former.

Luke081515 awarded a token.Jan 11 2016, 8:35 PM

Why? Limiting max_u_jobs would effectively limit the number of jobs a user can submit, thus he would have to account for the possibility that jsub may fail. This would break many setups where for example a job per wiki is submitted without expecting that this could fail.

IIRC we also already had some mechanism that limits the number of concurrently running jobs per user. Let me see if I can find it again.

In T60949#616510, @scfc wrote:

(In reply to comment #1)

Reading the IRC log, I don't quite understand why you need a *node* of your
own. Apparently, you want to run 200 jobs in parallel, and the problem is
the
12 concurrent jobs/user limit. So you really want to have the limit for your
bot raised to 200?

[...]

Just checked: Currently the limit seemed to be defined by:

scfc@tools-login:~$ qconf -srqs

{

name jobs

description NONE

enabled FALSE

limit users {*} queues {continuous,task} to jobs=16

}

scfc@tools-login:~$

*but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done").

I changed "enabled" to "TRUE" and added a first rule:

scfc@tools-login:~$ sudo qconf -srqs

{

name jobs

description NONE

enabled TRUE

limit users scfc to jobs=200

limit users {*} queues {continuous,task} to jobs=16

}

scfc@tools-login:~$

But I was still only able to launch 32 jobs, so I changed it back.

Further digging brought up:

scfc@tools-login:~$ qconf -ssconf

[...]

maxujobs 32

[...]

Ah! I'll test how to set per-user quotas over the next few days.

(No, I didn't as the task took a different turn.) Currently, qconf -srqs is:

{
   name         user_slots
   description  Users have 60 user_slot to allocate
   enabled      TRUE
   limit        users {*} hosts * to user_slot=60
}

(from T54976), so the 16 jobs/queue limit (if it worked – I honestly can't remember) got lost in the mean time.

Beetstra subscribed.Jan 12 2016, 3:12 AM

@Giftpflanze actually needs more than this because of their array jobs. I'm setting this to 1000 now, which should be enough even for extreme use cases. It might also still be enough to kill gridengine for jobs that are spread out (as opposed to the single giftbot queue), but at least it's better than infinite.

Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that). As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).

yuvipanda unsubscribed.Jun 30 2016, 2:01 PM

zhuyifei1999 merged a task: T103968: qstat missing on exec hosts.Feb 8 2017, 6:47 PM

zhuyifei1999 added subscribers: Steinsplitter, coren, yuvipanda, zhuyifei1999.

zhuyifei1999 merged a task: T67777: Limit number of jobs users can execute in parallel.Feb 8 2017, 7:01 PM

zhuyifei1999 added subscribers: Nemo_bis, Petrb.

To make this a bit less confusing, let's make this task about setting up execution nodes as submit hosts and T67777 about limitting the number of parallel tasks a user can execute.

scfc mentioned this in T67777: Limit number of jobs users can execute in parallel.Feb 9 2017, 8:39 AM

scfc triaged this task as Low priority.Feb 16 2017, 10:43 PM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:45 PM

bd808 added a parent task: T67777: Limit number of jobs users can execute in parallel.Jan 5 2019, 5:23 PM

bd808 edited parent tasks, added: T199271: Upgrade the tools gridengine system; removed: T67777: Limit number of jobs users can execute in parallel.Jan 8 2019, 4:06 AM

bd808 added a subtask: T67777: Limit number of jobs users can execute in parallel.

In T123270#1940526, @Beetstra wrote:

Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that).

At least in theory, -l hostname=<hostname> could be used to select a specific exec node. Making jsub do that by default seems like a bad idea because one of the main reasons to spawn new jobs is to spread load across the grid.

As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).

Is the blocker to IPC via TCP working across gird hosts discovering the other nodes that the jobs are running on, or something else? Off the top of my head I can't think of a reason that a job could not open a TCP socket that can be reached from other instances inside the job grid.

bd808 closed subtask T67777: Limit number of jobs users can execute in parallel as Resolved.Jan 17 2019, 11:56 PM

Change 485129 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts

https://gerrit.wikimedia.org/r/485129

gerritbot edited projects, added Patch-For-Review; removed Cloud-Services.Jan 18 2019, 12:06 AM

Since concurrent jobs are now limited by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-scheduler-config#6 and by https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/toolforge/grid-global-config#43

I think we can set the stretch grid exec nodes to have submit powers.

Change 485129 merged by Bstorm:
[operations/puppet@production] sonofgridengine: make exec hosts into submit hosts

https://gerrit.wikimedia.org/r/485129