Page MenuHomePhabricator

tools.iabot is overloading the grid by running too many workers in parallel
Closed, ResolvedPublic

Description

Currently tools.iabot is running 60 parallel jobs on grid exec nodes. From crontab it looks like there should be 20, but the jobs are launched every minute of the day, which means existing workers probably aren't finished before new jobs are already started.

We are dealing with high iowait on the grid right now, and scaled this back today. We've commented out all except 3 workers on crontab, and killed all the running workers, leaving only 3 running. Current status:

tools.iabot@tools-bastion-05:~$ crontab -l
PATH=/usr/local/bin:/usr/bin:/bin
##LOAD CRONTAB##
#* * * * * crontab $HOME/crontab

###IABOT ON DEMAND WORKERS###
* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker1 -o $HOME/Workers/Worker1.out -e $HOME/Workers/Worker1.err php guiworker.php 0 worker1
* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker2 -o $HOME/Workers/Worker2.out -e $HOME/Workers/Worker2.err php guiworker.php 0 worker2
* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker3 -o $HOME/Workers/Worker3.out -e $HOME/Workers/Worker3.err php guiworker.php 0 worker3
####### Commented out by a Tool admin -- this is overwhelming the grid 2017-4-1
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker4 -o $HOME/Workers/Worker4.out -e $HOME/Workers/Worker4.err php guiworker.php 0 worker4
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker5 -o $HOME/Workers/Worker5.out -e $HOME/Workers/Worker5.err php guiworker.php 0 worker5
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker6 -o $HOME/Workers/Worker6.out -e $HOME/Workers/Worker6.err php guiworker.php 0 worker6
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker7 -o $HOME/Workers/Worker7.out -e $HOME/Workers/Worker7.err php guiworker.php 0 worker7
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker8 -o $HOME/Workers/Worker8.out -e $HOME/Workers/Worker8.err php guiworker.php 0 worker8
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker9 -o $HOME/Workers/Worker9.out -e $HOME/Workers/Worker9.err php guiworker.php 0 worker9
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker10 -o $HOME/Workers/Worker10.out -e $HOME/Workers/Worker10.err php guiworker.php 0 worker10
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker11 -o $HOME/Workers/Worker11.out -e $HOME/Workers/Worker11.err php guiworker.php 0 worker11
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker12 -o $HOME/Workers/Worker12.out -e $HOME/Workers/Worker12.err php guiworker.php 0 worker12
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker13 -o $HOME/Workers/Worker13.out -e $HOME/Workers/Worker13.err php guiworker.php 0 worker13
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker14 -o $HOME/Workers/Worker14.out -e $HOME/Workers/Worker14.err php guiworker.php 0 worker14
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker15 -o $HOME/Workers/Worker15.out -e $HOME/Workers/Worker15.err php guiworker.php 0 worker15
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker16 -o $HOME/Workers/Worker16.out -e $HOME/Workers/Worker16.err php guiworker.php 0 worker16
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker17 -o $HOME/Workers/Worker17.out -e $HOME/Workers/Worker17.err php guiworker.php 0 worker17
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker18 -o $HOME/Workers/Worker18.out -e $HOME/Workers/Worker18.err php guiworker.php 0 worker18
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker19 -o $HOME/Workers/Worker19.out -e $HOME/Workers/Worker19.err php guiworker.php 0 worker19
#* * * * * cd  $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker20 -o $HOME/Workers/Worker20.out -e $HOME/Workers/Worker20.err php guiworker.php 0 worker20
tools.iabot@tools-bastion-05:~$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
3257207 0.30049 lighttpd-i tools.iabot  r     04/01/2017 10:06:57 webgrid-lighttpd@tools-webgrid     1
3267758 0.30000 worker3    tools.iabot  r     04/01/2017 15:04:07 task@tools-exec-1420.tools.eqi     1
3267759 0.30000 worker2    tools.iabot  r     04/01/2017 15:04:07 task@tools-exec-1421.tools.eqi     1
3267760 0.30000 worker1    tools.iabot  r     04/01/2017 15:04:07 task@tools-exec-1422.tools.eqi     1

When you get to this, please scale back the number of parallel workers, and institute some periodic delay on the crons, so jobs don't keep starting up in parallel and growing in number.

Event Timeline

I think the best approach is some kind of locking mechanism to prevent new workers from starting if an existing one of the same name is still running. Otherwise it's a war of escalation any time job runs start to lengthen.

I should note the once flag is set so it shouldn't be submitting more if the worker is already running. This sounds like a grid problem.

Hmmm it does look like jsub -once should not start more workers with same name, there may be some funkiness going on there, will follow up on Monday.

Well considering this is looking more like a grid issue and not a tool issue, I'm going to remove InternetArchiveBot from the tags.

@Cyberpower678 I feel like we have been down this road before. What are you doing with the IABot account that requires such an intense amount of resources? Locking 10Gb of RAM and 20 CPU cores on the shared job grid is a pretty big slice of the community's shared resources to take.

I agree that there is some issue with the jsub -once command that is being exposed here as well, but any tool that executes 20 parallel jobs once a minute via cron is abusing the shared resource pool. The bigbrother watchdog script is probably a better fit for ensuring that a job is always running than abusing cron and jsub -once like this. As noted in our documentation: Scheduling a command more often than every five minutes (e.g. * * * * * command) is highly discouraged, even if the command is "only" jsub.

On the point of the need for 20 workers, do you have any reasoning for this? Looking at the Worker1.out log file this single worker is almost always idle and logging "No jobs to work on at the moment. Sleeping for 1 minute.". The log file has 29,651 lines of output (wc -l Worker1.out). Of those lines, 14,766 are blank and 14,580 are the sleep notice. That leaves 305 lines of output showing the script to be doing some work of value. Since there are no timestamps it is difficult to tell over what period of time these 305 useful tasks were executed, but if it really is sleeping for 60 seconds each time it logs that message this covers at least 10 days of execution. The logs for Worker2 and Worker3 show a similar distribution. It looks to me like this could really be dialed down to a single worker and then monitored to see if the job backlog rises to an unacceptable level at which point you could slowly add more executors to keep the backlog at a reasonable level.

I wasn't aware that big brother can work on jobs other than the web service. Thank you for pointing that out. Now that you mention it, 20 workers was indeed overkill, and I didn't expect that 20 idling scripts doing almost nothing would cause problems. In that case since use case is not very high at the moment, I have no issues only using 1 worker until the job queue rises out of control.

I am sorry for the inconvenience.

I wasn't aware that big brother can work on jobs other than the web service. Thank you for pointing that out. Now that you mention it, 20 workers was indeed overkill, and I didn't expect that 20 idling scripts doing almost nothing would cause problems. In that case since use case is not very high at the moment, I have no issues only using 1 worker until the job queue rises out of control.

I am sorry for the inconvenience.

fwiw I still see the same 3 workers from triage here and in the cron file. That is fine I think if you want to leave it. We do not seem to have leaked any workers since April 1st

I setup this .bgibrotherrc file, commented out all of the cron job lines, and killed the worker2 and worker3 processes.

/data/project/iabot/.bigbrotherrc
jstart -N worker1 -quiet -mem 512m -o $HOME/Workers/Worker1.out -e $HOME/Workers/Worker1.err php $HOME/public_html/guiworker.php 0 worker1

@Cyberpower678 if you see problems with the bigbrothter monitor open a new ticket and we will help you work through it.

Well, big brother isn't able to do it's job. My workers will not start.

As far as I can see,

valhallasw@tools-bastion-03:/data/project/iabot$ qstat -u tools.iabot
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4010649 0.30109 lighttpd-i tools.iabot  r     04/17/2017 04:28:01 webgrid-lighttpd@tools-webgrid     1
4029925 0.30001 worker1    tools.iabot  r     04/17/2017 16:19:56 task@tools-exec-1415.tools.eqi     1

the worker1 process specified in .bigbrotherrc is running?

As far as I can see,

valhallasw@tools-bastion-03:/data/project/iabot$ qstat -u tools.iabot
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
4010649 0.30109 lighttpd-i tools.iabot  r     04/17/2017 04:28:01 webgrid-lighttpd@tools-webgrid     1
4029925 0.30001 worker1    tools.iabot  r     04/17/2017 16:19:56 task@tools-exec-1415.tools.eqi     1

the worker1 process specified in .bigbrotherrc is running?

I'll let @bd808 explain.

Apparently the tool had to be restarted manually because it required the current working directory to be $HOME/public_html in order to properly function. I have created a thin shell script shim to run from $HOME/.bigbrotherrc via jstart to correct this:

run_guiworker.sh
#!/usr/bin/env bash
# Launch a guiworker.php process
#
# All command line arguments are passed through to the
# $HOME/public_html/guiworker.php script.
#
# usage: $0 0 worker1

# Exit with error code if anything in this script fails
set -e

# Change working directory so that relative includes and PHP default search
# path are as expected.
cd $HOME/public_html

# Replace this script with the php process
exec /usr/bin/php guiworker.php "$@"
.bigbrotherrc
jstart -N worker1 -quiet -mem 512m -o $HOME/Workers/Worker1.out -e $HOME/Workers/Worker1.err $HOME/run_guiworker.sh 0 worker1