Page MenuHomePhabricator

Tool Labs: jsub starts multiple instances of tasks declared as "once"
Closed, ResolvedPublic

Description

In bug 53629, Coren mentioned: "there is no locking, so if you jstart twice within a very short period of time (a few seconds) both invocations would so none running and start" "Locking would be a reasonable added safeguard, and even when cron gets replaced it would remain useful, but has a few implementation gotchas that will be tricky to get right. Nevertheless, having a bug for it would not be a bad thing."

Details

Reference
bz60862

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:55 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz60862.
  • Bug 69867 has been marked as a duplicate of this bug. ***

Hm.. if bug 69867 is a dupe of this, how come it was jstarted twice in such a short period of time? It runs every 2 or 5 minutes. Surely 'jstart' doesn't take that long to run?

I assume (= my reasoning for flagging bug #69867 as a dupe) that network congestion and/or load on the client, the SGE server or the DNS server can cause jstart to take such "long" to run (or fail at the "check if job is running" phase).

Happened again at tools.wmfdbbbot:

qstat:
3343394 0.31914 dbbot-wm tools.wmfdbb r 08/21/2014 18:40:18 continuous@tools-exec-13.eqiad 1
3488693 0.30383 dbbot-wm tools.wmfdbb r 08/27/2014 04:35:23 continuous@tools-exec-10.eqiad 1
3505161 0.30205 dbbot-wm tools.wmfdbb r 08/27/2014 19:40:27 continuous@tools-exec-06.eqiad 1

crontab:
*/5 * * * * /usr/bin/jsub -N dbbot-wm -once -continuous -quiet -mem 1280M -o /dev/null php ~/apps/ts-krinkle-Kribo/Init.php

And tools.ecmabot:

3343333 0.31965 ecmabot-wm tools.ecmabo r 08/21/2014 18:36:18 continuous@tools-exec-09.eqiad 1
3371234 0.31670 ecmabot-wm tools.ecmabo r 08/22/2014 19:36:27 continuous@tools-exec-08.eqiad 1
3376532 0.31614 ecmabot-wm tools.ecmabo r 08/23/2014 00:22:27 continuous@tools-exec-10.eqiad 1
3383752 0.31536 ecmabot-wm tools.ecmabo r 08/23/2014 07:00:27 continuous@tools-exec-13.eqiad 1
3450460 0.30830 ecmabot-wm tools.ecmabo r 08/25/2014 18:56:16 continuous@tools-exec-10.eqiad 1
3467240 0.30656 ecmabot-wm tools.ecmabo r 08/26/2014 09:44:23 continuous@tools-exec-11.eqiad 1
3477929 0.30546 ecmabot-wm tools.ecmabo r 08/26/2014 19:06:23 continuous@tools-exec-01.eqiad 1
3484066 0.30482 ecmabot-wm tools.ecmabo r 08/27/2014 00:34:23 continuous@tools-exec-11.eqiad 1
3484971 0.30472 ecmabot-wm tools.ecmabo r 08/27/2014 01:20:23 continuous@tools-exec-11.eqiad 1
3486786 0.30454 ecmabot-wm tools.ecmabo r 08/27/2014 02:56:23 continuous@tools-exec-04.eqiad 1
3488164 0.30440 ecmabot-wm tools.ecmabo r 08/27/2014 04:06:23 continuous@tools-exec-13.eqiad 1
3495944 0.30356 ecmabot-wm tools.ecmabo r 08/27/2014 11:16:23 continuous@tools-exec-06.eqiad 1
3499118 0.30321 ecmabot-wm tools.ecmabo r 08/27/2014 14:10:26 continuous@tools-exec-13.eqiad 1
3503921 0.30270 ecmabot-wm tools.ecmabo r 08/27/2014 18:34:27 continuous@tools-exec-04.eqiad 1
3506230 0.30245 ecmabot-wm tools.ecmabo r 08/27/2014 20:38:27 continuous@tools-exec-06.eqiad 1

Is there some change we're expected to make in how we invoke it from cron?

(In reply to Krinkle from comment #5)

[...]

Is there some change we're expected to make in how we invoke it from cron?

You can use bigbrother as described in http://permalink.gmane.org/gmane.org.wikimedia.labs/2757. As there is only one bigbrother instance that processes all jobs sequentially, no race conditions can occur. The interval between checks is 10 seconds when idling.

Still happening. ecmabot-wm had many concurrent instances running despite -once.

(In reply to Tim Landscheidt from comment #6)

(In reply to Krinkle from comment #5)

[...]

Is there some change we're expected to make in how we invoke it from cron?

You can use bigbrother as described in
http://permalink.gmane.org/gmane.org.wikimedia.labs/2757. As there is only
one bigbrother instance that processes all jobs sequentially, no race
conditions can occur. The interval between checks is 10 seconds when idling.

I'm generally avoiding to use things recommended in mailing lists because they're fixed in time. There's no way of telling whether that recommendation has changed, and for new users they'd never know. If this is really what users should be using instead of jsub, maybe it should be documented at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help

Krinkle renamed this task from Tool Labs: jsub should prevent starting duplicate jobs for -once tasks to Tool Labs: jsub starts multiple instances of tasks declared as "once".EditedJan 13 2015, 9:54 PM
Krinkle updated the task description. (Show Details)
Krinkle set Security to None.

Still happening. Just had to kill a dozen instances of ecmabot that built up over time.

tools.ecmabot@tools-login:~$ cat crontab.txt

*/2 * * * * /usr/bin/jsub -N ecmabot-wm -once -continuous -quiet -stderr -mem 1700M node ~/apps/oftn-bot/wm-ecmabot.js > /dev/null 2>&1
$ qstat
5157291 0.42391 ecmabot-wm tools.ecmabo r     10/27/2014 20:02:06 continuous@tools-exec-04.eqiad     1        
5319489 0.41475 ecmabot-wm tools.ecmabo r     11/02/2014 14:34:10 continuous@tools-exec-09.eqiad     1        
6663769 0.33684 ecmabot-wm tools.ecmabo r     12/21/2014 16:48:23 continuous@tools-exec-13.eqiad     1        
6811426 0.32820 ecmabot-wm tools.ecmabo r     12/27/2014 03:23:32 continuous@tools-exec-04.eqiad     1        
6819919 0.32769 ecmabot-wm tools.ecmabo r     12/27/2014 11:08:05 continuous@tools-exec-02.eqiad     1  

.. and again. These were active on Jan 29

$ qstat
4486516 0.48030 dbbot-wm   tools.wmfdbb Rr    01/28/2015 06:53:14 continuous@tools-exec-02.eqiad     1        
7131408 0.33326 dbbot-wm   tools.wmfdbb Rr    01/27/2015 22:26:03 continuous@tools-exec-13.eqiad     1        
7664329 0.30319 dbbot-wm   tools.wmfdbb r     01/27/2015 22:26:33 continuous@tools-exec-05.eqiad     1        
coren removed coren as the assignee of this task.Mar 25 2015, 3:21 PM
coren triaged this task as High priority.
coren subscribed.

Timing shows this indeed happens in cases of NFS stalls or other systemic congestion (such as the recent hardware outages). Adding locking to jstart should fix this for good.

Indeed. Just spotted that dbbot-wm on IRC had three concurrent incarnations running. Presumably caused by the NFS maintenance earlier this week? I've migrated wmfdbbot to bigbrother. (I didn't realise it supports jsub in addition to web services.)

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
8070590 0.36091 dbbot-wm   tools.wmfdbb Rr    03/30/2015 21:48:13 continuous@tools-exec-11.eqiad     1        
8404582 0.34471 dbbot-wm   tools.wmfdbb Rr    03/30/2015 21:47:43 continuous@tools-exec-02.eqiad     1        
9503603 0.30064 dbbot-wm   tools.wmfdbb r     03/31/2015 05:21:30 continuous@tools-exec-06.eqiad     1        
9504117 0.30062 dbbot-wm   tools.wmfdbb r     03/31/2015 06:08:54 continuous@tools-exec-11.eqiad     1  
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
8404516 0.35188 ecmabot-wm tools.ecmabo Rr    03/30/2015 21:47:43 continuous@tools-exec-06.eqiad     1        
8404554 0.35188 ecmabot-wm tools.ecmabo Rr    03/30/2015 21:48:43 continuous@tools-exec-15.eqiad     1        
9503458 0.30851 ecmabot-wm tools.ecmabo r     03/31/2015 05:15:47 continuous@tools-exec-02.eqiad     1        
9503518 0.30851 ecmabot-wm tools.ecmabo r     03/31/2015 05:20:45 continuous@tools-exec-14.eqiad     1        
9503698 0.30850 ecmabot-wm tools.ecmabo r     03/31/2015 05:30:04 continuous@tools-exec-11.eqiad     1        
9504089 0.30848 ecmabot-wm tools.ecmabo r     03/31/2015 06:08:11 continuous@tools-exec-15.eqiad     1        
9504160 0.30846 ecmabot-wm tools.ecmabo r     03/31/2015 06:16:43 continuous@tools-exec-14.eqiad     1        
9551773 0.30615 ecmabot-wm tools.ecmabo r     04/02/2015 02:15:11 continuous@tools-exec-14.eqiad     1        
9551989 0.30611 ecmabot-wm tools.ecmabo r     04/02/2015 03:21:46 continuous@tools-exec-15.eqiad     1        
9552018 0.30611 ecmabot-wm tools.ecmabo r     04/02/2015 03:22:01 continuous@tools-exec-15.eqiad     1        
9552824 0.30599 ecmabot-wm tools.ecmabo r     04/02/2015 05:19:42 continuous@tools-exec-13.eqiad     1        
9587426 0.30456 ecmabot-wm tools.ecmabo r     04/03/2015 08:39:11 continuous@tools-exec-08.eqiad     1        
[23:37 UTC] tools.ecmabot at tools-bastion-01.eqiad.wmflabs in ~
$ qdel 8404516 8404554 9503458 9503518 9503698 9504089 9504160 9551773 9551989 9552018 9552824 9587426
...
# crontab
*/2 * * * * /usr/bin/jsub -N ecmabot-wm -once -continuous -quiet -stderr -mem 1700M node ~/apps/oftn-bot/wm-ecmabot.js > /dev/null 2>&1
valhallasw lowered the priority of this task from High to Medium.Aug 2 2015, 1:35 PM
valhallasw subscribed.

I got hit by this again after some NFS issue / maintenance:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2176874 0.33696 php_dispat tools.liange Rr    01/21/2016 20:14:03 continuous@tools-exec-1403.eqi     1        
2557022 0.31185 lighttpd-l tools.liange r     01/21/2016 21:30:57 webgrid-lighttpd@tools-webgrid     1        
2674248 0.30267 php_cleanu tools.liange r     01/25/2016 23:03:03 task@tools-exec-1405.eqiad.wmf     1        
2695870 0.30098 php_dispat tools.liange r     01/26/2016 17:01:31 continuous@tools-exec-1408.eqi     1        
2699329 0.30069 php_cleanu tools.liange r     01/26/2016 20:04:51 task@tools-exec-1401.eqiad.wmf     1        
2710470 0.30000 php_popula tools.liange r     01/27/2016 03:27:03 task@tools-exec-1407.eqiad.wmf     1

job ID 2176874 and job ID 2695870, but note that 2176874 was started a long time ago than 2695870.

I've tried to reproduce this issue without success. Running jstart multiple times in sequence with -once never started a new task.

Closing task for now since there have been too many changes to the infrastructure since 2014-2016. Feel free to re-open if the issue happens again.