Page MenuHomePhabricator

Dibot running many copies of same job on job grid
Closed, ResolvedPublic

Description

The report at https://tools.wmflabs.org/grid-jobs/tool/dibot currently shows many duplicate jobs running for dibot:

Job 	Total seen 	Active 	Last seen (exit)
filemoves_replacer 	84 	31 	Currently running
inc_check 	129 	33 	Currently running
inc_image 	4 	2 	Currently running
inc_main 	7 	5 	Currently running
inc_mritog 	13 	5 	Currently running
inc_redirect_deleter 	7 	2 	Currently running
inc_remindbot 	3 	0 	2018-06-22 03:26
lighttpd-dibot 	1 	1 	Currently running
nullbot 	13 	4 	Currently running
pats-gadget 	84 	30 	Currently running
removeout 	1 	0 	2018-06-19 23:58
statbot 	3 	1 	Currently running

The crontab for this tool includes the -once flag for all of these jobs except the pats-gadget and filemoves_replacer jobs. In theory this flag should have prevented multiple jobs with the same name from starting. In practice it obviously did not.

There are two issues to address here:

  • Stopping extra jobs that are running to free up grid engine capacity for all tools
  • Understanding what could have made jsub -once fail like this

This could possibly be related to T194380: Identify bots using AES128-SHA maintainers running on toolforge and T195834: mono-based bot hangs after mono version upgrade as these jobs all include a -v MONO_TLS_PROVIDER=btls flag in the crontab.

Event Timeline

bd808 created this task.Jun 22 2018, 6:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 22 2018, 6:18 PM
bd808 added a comment.Jun 22 2018, 6:21 PM

@Dmitry89 can I stop the duplicated jobs safely and expect that when the next cron triggered run starts things will pick up from where they left off? Also, can the concurrent jobs for pats-gadget and filemoves_replacer be limited using -once as well?

bd808 closed this task as Resolved.Jun 26 2018, 11:56 PM
bd808 claimed this task.
bd808 added subscribers: zhuyifei1999, MBH.

20:15:37 <Nemo_bis> This host is too busy with mono bots anyway ;) Reminded me to check back. Using T195834#4241876, there are 110 such processes from dibot and one from mbh (@MaxBioHazard yours is tools.mbh /mnt/nfs/labstore-secondary-tools-project/mbh/mono48/bin/mono-sgen /data/project/mbh/bots/retired_counter.exe).

Looking at dibot's crontab, it now contains -once and -mem 4G, so I'm assuming the maintainer added the args after the re-enable, and forgot to kill the affected jobs. I'll kill them.

bd808 reassigned this task from bd808 to zhuyifei1999.Jun 26 2018, 11:56 PM
Vvjjkkii renamed this task from Dibot running many copies of same job on job grid to zfaaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed zhuyifei1999 as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from zfaaaaaaaa to Dibot running many copies of same job on job grid.Jul 1 2018, 4:26 AM
JJMC89 closed this task as Resolved.
JJMC89 assigned this task to zhuyifei1999.
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)