Currently tools.iabot is running 60 parallel jobs on grid exec nodes. From crontab it looks like there should be 20, but the jobs are launched every minute of the day, which means existing workers probably aren't finished before new jobs are already started.
We are dealing with high iowait on the grid right now, and scaled this back today. We've commented out all except 3 workers on crontab, and killed all the running workers, leaving only 3 running. Current status:
tools.iabot@tools-bastion-05:~$ crontab -l PATH=/usr/local/bin:/usr/bin:/bin ##LOAD CRONTAB## #* * * * * crontab $HOME/crontab ###IABOT ON DEMAND WORKERS### * * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker1 -o $HOME/Workers/Worker1.out -e $HOME/Workers/Worker1.err php guiworker.php 0 worker1 * * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker2 -o $HOME/Workers/Worker2.out -e $HOME/Workers/Worker2.err php guiworker.php 0 worker2 * * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker3 -o $HOME/Workers/Worker3.out -e $HOME/Workers/Worker3.err php guiworker.php 0 worker3 ####### Commented out by a Tool admin -- this is overwhelming the grid 2017-4-1 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker4 -o $HOME/Workers/Worker4.out -e $HOME/Workers/Worker4.err php guiworker.php 0 worker4 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker5 -o $HOME/Workers/Worker5.out -e $HOME/Workers/Worker5.err php guiworker.php 0 worker5 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker6 -o $HOME/Workers/Worker6.out -e $HOME/Workers/Worker6.err php guiworker.php 0 worker6 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker7 -o $HOME/Workers/Worker7.out -e $HOME/Workers/Worker7.err php guiworker.php 0 worker7 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker8 -o $HOME/Workers/Worker8.out -e $HOME/Workers/Worker8.err php guiworker.php 0 worker8 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker9 -o $HOME/Workers/Worker9.out -e $HOME/Workers/Worker9.err php guiworker.php 0 worker9 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker10 -o $HOME/Workers/Worker10.out -e $HOME/Workers/Worker10.err php guiworker.php 0 worker10 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker11 -o $HOME/Workers/Worker11.out -e $HOME/Workers/Worker11.err php guiworker.php 0 worker11 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker12 -o $HOME/Workers/Worker12.out -e $HOME/Workers/Worker12.err php guiworker.php 0 worker12 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker13 -o $HOME/Workers/Worker13.out -e $HOME/Workers/Worker13.err php guiworker.php 0 worker13 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker14 -o $HOME/Workers/Worker14.out -e $HOME/Workers/Worker14.err php guiworker.php 0 worker14 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker15 -o $HOME/Workers/Worker15.out -e $HOME/Workers/Worker15.err php guiworker.php 0 worker15 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker16 -o $HOME/Workers/Worker16.out -e $HOME/Workers/Worker16.err php guiworker.php 0 worker16 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker17 -o $HOME/Workers/Worker17.out -e $HOME/Workers/Worker17.err php guiworker.php 0 worker17 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker18 -o $HOME/Workers/Worker18.out -e $HOME/Workers/Worker18.err php guiworker.php 0 worker18 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker19 -o $HOME/Workers/Worker19.out -e $HOME/Workers/Worker19.err php guiworker.php 0 worker19 #* * * * * cd $HOME/public_html && jsub -quiet -mem 512m -cwd -once -N worker20 -o $HOME/Workers/Worker20.out -e $HOME/Workers/Worker20.err php guiworker.php 0 worker20
tools.iabot@tools-bastion-05:~$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 3257207 0.30049 lighttpd-i tools.iabot r 04/01/2017 10:06:57 webgrid-lighttpd@tools-webgrid 1 3267758 0.30000 worker3 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1420.tools.eqi 1 3267759 0.30000 worker2 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1421.tools.eqi 1 3267760 0.30000 worker1 tools.iabot r 04/01/2017 15:04:07 task@tools-exec-1422.tools.eqi 1
When you get to this, please scale back the number of parallel workers, and institute some periodic delay on the crons, so jobs don't keep starting up in parallel and growing in number.