Page MenuHomePhabricator

Concurrent generated jobs from a single user overloaded grid engine
Closed, ResolvedPublic

Description

@Debenben wrote a neat bash script that generated a collection of jobs to process dump files for each wiki. This script unfortunately had no rate limiting features so it ended up flooding the job grid with concurrent jobs to the point that all of the "slots" we allocate for starting jobs were full.

@chasemp deleted the jobs and temporarily blocked @Debenben from logging into tools-login until we can contact them and find a way to make the jobs queue up rather than all running in parallel.

@Quiddity was nice enough to start the conversation on Dubenben's talk page: https://wikitech.wikimedia.org/w/index.php?title=User_talk:Debenben&diff=1793869&oldid=122980

Event Timeline

Following the advice from https://serverfault.com/a/184214/6479, I am adding a quota limiting Dubenben's user account to 2 simultaneous jobs:

$ sudo -i qconf -srqs debenben_max_slots
{
   name         debenben_max_slots
   description  "Limit user debenben to 2 slots"
   enabled      TRUE
   limit        users {debenben} hosts * to slots=2
}

Mentioned in SAL (#wikimedia-cloud) [2018-06-05T17:38:49Z] <bd808> Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)

The quota seems to work:

$ for n in $(seq 1 9); do jsub -N test-$n test-concurrency.sh; done
Your job 6900851 ("test-1") has been submitted
Your job 6900852 ("test-2") has been submitted
Your job 6900853 ("test-3") has been submitted
Your job 6900854 ("test-4") has been submitted
Your job 6900855 ("test-5") has been submitted
Your job 6900856 ("test-6") has been submitted
Your job 6900857 ("test-7") has been submitted
Your job 6900858 ("test-8") has been submitted
Your job 6900859 ("test-9") has been submitted
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6900851 0.30000 test-1     debenben     r     06/05/2018 17:50:53 task@tools-exec-1430.tools.eqi     1
6900852 0.30000 test-2     debenben     r     06/05/2018 17:50:53 task@tools-exec-1418.tools.eqi     1
6900853 0.30000 test-3     debenben     qw    06/05/2018 17:50:53                                    1
6900854 0.30000 test-4     debenben     qw    06/05/2018 17:50:53                                    1
6900855 0.30000 test-5     debenben     qw    06/05/2018 17:50:53                                    1
6900856 0.30000 test-6     debenben     qw    06/05/2018 17:50:53                                    1
6900857 0.30000 test-7     debenben     qw    06/05/2018 17:50:53                                    1
6900858 0.30000 test-8     debenben     qw    06/05/2018 17:50:53                                    1
6900859 0.30000 test-9     debenben     qw    06/05/2018 17:50:54                                    1
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6900855 0.30000 test-5     debenben     r     06/05/2018 17:51:55 task@tools-exec-1430.tools.eqi     1
6900856 0.30000 test-6     debenben     r     06/05/2018 17:51:56 task@tools-exec-1418.tools.eqi     1
6900857 0.30000 test-7     debenben     qw    06/05/2018 17:50:53                                    1
6900858 0.30000 test-8     debenben     qw    06/05/2018 17:50:53                                    1
6900859 0.30000 test-9     debenben     qw    06/05/2018 17:50:54                                    1
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
6900859 0.30000 test-9     debenben     r     06/05/2018 17:52:57 task@tools-exec-1430.tools.eqi     1
$ qstat
$ 

Mentioned in SAL (#wikimedia-cloud) [2018-06-05T18:02:51Z] <bd808> Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)

@Debenben, I think you can resume your work. The job grid should automatically limit you to running 2 concurrent jobs now regardless of how many you submit at the same time. Jobs that are queued for later execution will show with state qw in the outpur of qstat. The current limit of 2 is very conservative. We can revisit it if you find that running only 2 at a time will make your project take weeks/months to finish. I don't think we can go much above 6-8 concurrent jobs however and be fair to others since your parallel dumps parsing will be so IO intensive for the NFS servers.

bd808 claimed this task.

Just to note my related actions: I found another flood of error messages coming in and killed the run of jobs just in case. The problem was just that jobs were going to web queues. I've suggested that the user submit to the task queue expressly. The queuing is working fine.

Vvjjkkii renamed this task from Concurrent generated jobs from a single user overloaded grid engine to dlbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed bd808 as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from dlbaaaaaaa to Concurrent generated jobs from a single user overloaded grid engine.Jul 1 2018, 8:36 PM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to bd808.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.