Page MenuHomePhabricator

bigbrother only watches users jobs if they already have a job running
Closed, DeclinedPublic

Description

bigbrother reads a list of all jobs running and then only looks for .bigbrotherrc in the home directories of users that appear in the former list. So if a user wants to schedule his first job or all jobs of a user terminated due to an outage, it won't (re-)start those.

Event Timeline

scfc raised the priority of this task from to Needs Triage.
scfc updated the task description. (Show Details)
scfc added a project: Toolforge.
scfc added a subscriber: scfc.

All users with .bigbrotherrcs:

sudo find /data/project /home -mindepth 2 -maxdepth 2 -type f -name .bigbrotherrc | sed -ne 's!/data/project/\([^/]\+\)/\.bigbrotherrc$!tools.\1!p;' -e 's!/home/\([^/]\+\)/\.bigbrotherrc$!\1!p;' | sort

All users with jobs in qstat:

qstat -u \* -xml | sed -ne 's/^ *<JB_owner>\(.*\)<\/JB_owner>$/\1/p;' | sort -u

Users in the first list who are not in the second are the problem. At this moment, there are only two which are irrelevant (one "user user" with the wish for a webservice, one "tool user" with an empty .bigbrotherrc).

scfc triaged this task as Lowest priority.Apr 6 2015, 5:53 AM
scfc set Security to None.

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

Likely not if the perl version didn't handle this. My rewrite was mostly a perl->python change to make it more likely that we would maintain the script.

@bd808, does the new bigbrother start jobs listed in .bigbrotherrc for users that have no jobs running at bigbrother start?

Likely not if the perl version didn't handle this. My rewrite was mostly a perl->python change to make it more likely that we would maintain the script.

It looks like basically at some point while the bigbrother.py process has been running a user will need to have an active job on the job grid for bigbrother to watch for a restart missing jobs. update_db looks at all running jobs on the grid and then checks for an rcfile for each job owner. Once the rcfile has been processed once it will be rechecked 0-60m minutes later.

One way to fix this bug would be to add a step between update_db and check_watches that looks for rcfiles that are not currently tracked in the internal state of the bigbrother.py process and loads them with read_config.