Page MenuHomePhabricator

Can not submit Grid jobs in Toolforge
Closed, ResolvedPublic

Description

I have a bot in the toolserver. It uses nodejs to run the task.
I find that form today midnight (2019-07-26 00:01 UTC), I can't execute my bot anymore. https://tools.wmflabs.org/sge-jobs/tool/cewbot

When I run qstat-full, it shows like this:
6332752 0.00000 cron-tools.cewbot-20180511.headline tools.cewbot qw 2019-07-26T05:05:04 1
Just quered but nothing go further.

The node version changed to v12.7.0 yesterday, but I think it's no metter since I also can't run scripts using /usr/bin/node.
I have changed the description of the bot, no other changes.

I try the simplest script:
/usr/bin/jsub -N env_on_jobs -once -quiet node /data/project/cewbot/wikibot/env.js
But it still hanged.
6333079 0.00000 env_on_jobs tools.cewbot qw 2019-07-26T05:17:45 1

Here is the script:
https://github.com/kanasimi/wikibot/blob/master/archive/env.js

The scripts run normally in the commamd line.

This also won't run:
/usr/bin/jsub -N test_job -once -quiet perl -e "print 12345"
qstat-full
6334080 0.00000 test_job tools.cewbot qw 2019-07-26T05:52:04 1

I find others can still run their tasks https://tools.wmflabs.org/sge-status/
A friend help me to run the script, and he successful execute the script. https://tools.wmflabs.org/sge-jobs/tool/hamishbot
So it seens my problem, but I don't know what's going wrong.

My bot also runs daily job on several wikis. It's very troublesome for me.
May someone help me? Thank you.

Related Objects

Event Timeline

Update:

It passed some tasks just now. So it's not just my problem?

But when I trying to execute new tasks, it hangs again.

It seens when I run the .js file manually in toolserver CLI, the corresponding task (the same .js file) will also passed. But I still don't know why.

This comment was removed by Kanashimi.
Kanashimi claimed this task.

It seems going back to normal.

Krinkle renamed this task from Can not run code in toolserver to Can not submit Grid jobs in Toolforge.Jul 26 2019, 2:45 PM

The situation occurs again. Perhaps the resource of grid engine is not enough? So we must waiting?

Am I understanding correctly that when you run /usr/bin/jsub -N env_on_jobs -once -quiet node /data/project/cewbot/wikibot/env.js the job env_on_jobs is queued unexpectedly for a very long time?

$ sudo become cewbot
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 966599 0.75000 cron-tools tools.cewbot Rr    04/10/2019 10:07:20 continuous@tools-sgeexec-0913.     1
 966619 0.59113 cron-tools tools.cewbot Rr    05/15/2019 15:43:27 continuous@tools-sgeexec-0914.     1
1754591 0.74831 cron-tools tools.cewbot r     04/10/2019 19:05:06 continuous@tools-sgeexec-0927.     1
1756453 0.74811 cron-tools tools.cewbot r     04/10/2019 20:07:05 continuous@tools-sgeexec-0906.     1
1994763 0.59103 cron-tools tools.cewbot Rr    05/15/2019 11:33:12 continuous@tools-sgeexec-0910.     1
1994764 0.72252 cron-tools tools.cewbot r     04/16/2019 11:31:00 continuous@tools-sgeexec-0923.     1
1994765 0.72252 cron-tools tools.cewbot r     04/16/2019 11:31:00 continuous@tools-sgeexec-0922.     1
1994766 0.72252 cron-tools tools.cewbot r     04/16/2019 11:31:00 continuous@tools-sgeexec-0916.     1
2012358 0.72057 cron-tools tools.cewbot r     04/16/2019 21:52:15 continuous@tools-sgeexec-0912.     1
2821843 0.63474 cron-tools tools.cewbot r     05/05/2019 20:02:06 continuous@tools-sgeexec-0936.     1
4227313 0.48687 cron-tools tools.cewbot r     06/07/2019 10:31:22 continuous@tools-sgeexec-0921.     1
5822724 0.32115 cron-tools tools.cewbot r     07/13/2019 23:26:17 continuous@tools-sgeexec-0934.     1
6091676 0.29103 cron-tools tools.cewbot r     07/20/2019 14:48:17 continuous@tools-sgeexec-0927.     1
6295283 0.26944 cron-tools tools.cewbot r     07/25/2019 09:04:08 continuous@tools-sgeexec-0904.     1
6498998 0.25000 cron-tools tools.cewbot r     07/29/2019 15:55:06 task@tools-sgeexec-0927.tools.     1
6499031 0.25000 cron-tools tools.cewbot r     07/29/2019 15:55:06 task@tools-sgeexec-0938.tools.     1

The Grid Engine is configured to limit each tool to a maximum of 16 concurrently running tasks. If there are 16 already running (like the qstat above shows) then additional jobs submitted will be queued and not started until one of the active tasks exits.

@bd808 thank you for explaining the reason. May I increase the limit?

@bd808 thank you for explaining the reason. May I increase the limit?

Currently we do not have a process to raise the limit for an individual tool. The current limit is a global configuration which applies to all users of the job grid.

The short term things you could do to work within the limit:

  • Combine some of your continuous jobs into a single job that does more things
  • Replace some or all continuous jobs with periodic jobs (cron) that do a thing and then exit to allow another job to start
  • Create a second tool account (for example 'cewbot2') and create new jobs there
  • Create separate tool account for each project (for example 'cewbot-zhwiki') and move jobs as approriate

@Kanashimi looking at https://tools.wmflabs.org/sge-jobs/tool/cewbot another suggestion for splitting the tool up would be to make a 'cewbot-sigcheck` tool account and move your cewbot-20170515.signature_check.* jobs there.