bigbrother only watches users jobs if they already have a job running
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	scfc
	Jan 30 2015, 5:22 PM

Description

bigbrother reads a list of all jobs running and then only looks for .bigbrotherrc in the home directories of users that appear in the former list. So if a user wants to schedule his first job or all jobs of a user terminated due to an outage, it won't (re-)start those.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T91414 make bigbrother or its replacement reliable
		Declined		None	T88122 bigbrother only watches users jobs if they already have a job running

Event Timeline

scfc created this task.Jan 30 2015, 5:22 PM

scfc raised the priority of this task from to Needs Triage.

scfc updated the task description. (Show Details)

scfc added a project: Toolforge.

scfc subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2015, 5:22 PM

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.Feb 23 2015, 2:30 AM

All users with .bigbrotherrcs:

sudo find /data/project /home -mindepth 2 -maxdepth 2 -type f -name .bigbrotherrc | sed -ne 's!/data/project/\([^/]\+\)/\.bigbrotherrc$!tools.\1!p;' -e 's!/home/\([^/]\+\)/\.bigbrotherrc$!\1!p;' | sort

All users with jobs in qstat:

qstat -u \* -xml | sed -ne 's/^ *<JB_owner>\(.*\)<\/JB_owner>$/\1/p;' | sort -u

Users in the first list who are not in the second are the problem. At this moment, there are only two which are irrelevant (one "user user" with the wish for a webservice, one "tool user" with an empty .bigbrotherrc).

yuvipanda added subscribers: yuvipanda, coren.Feb 23 2015, 2:20 PM

JanZerebecki added a parent task: T91414: make bigbrother or its replacement reliable.Mar 3 2015, 5:17 PM

scfc triaged this task as Lowest priority.Apr 6 2015, 5:53 AM

scfc mentioned this in T94500: bigbrother doesn't stop.

scfc set Security to None.

scfc merged a task: T105223: [Regression] BigBrother isn't handling jstart .Jul 9 2015, 12:42 AM

scfc added a subscriber: Krinkle.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJul 9 2015, 12:42 AM

scfc mentioned this in T105223: [Regression] BigBrother isn't handling jstart .Jul 9 2015, 12:44 AM

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

In T88122#2885398, @scfc wrote:

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

Likely not if the perl version didn't handle this. My rewrite was mostly a perl->python change to make it more likely that we would maintain the script.

In T88122#2885401, @bd808 wrote:

In T88122#2885398, @scfc wrote:

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

Likely not if the perl version didn't handle this. My rewrite was mostly a perl->python change to make it more likely that we would maintain the script.

It looks like basically at some point while the bigbrother.py process has been running a user will need to have an active job on the job grid for bigbrother to watch for a restart missing jobs. update_db looks at all running jobs on the grid and then checks for an rcfile for each job owner. Once the rcfile has been processed once it will be rechecked 0-60m minutes later.

One way to fix this bug would be to add a step between update_db and check_watches that looks for rcfiles that are not currently tracked in the internal state of the bigbrother.py process and loads them with read_config.

bd808 mentioned this in T154527: Bigbrother doesn't restart wikilinkbot.Jan 3 2017, 10:18 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:55 PM

bd808 mentioned this in T208357: toolforge - Deprecate BigBrother in Grid Engine.Oct 31 2018, 7:50 PM

BigBrother has been disabled in Toolforge.

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother_(Deprecated)

• GTirloni closed this task as Declined.Dec 12 2018, 8:56 AM

• GTirloni edited projects, added cloud-services-team (Kanban); removed Cloud-Services.