bigbrother reads a list of all jobs running and then only looks for .bigbrotherrc in the home directories of users that appear in the former list. So if a user wants to schedule his first job or all jobs of a user terminated due to an outage, it won't (re-)start those.
|Declined||None||T91414 make bigbrother or its replacement reliable|
|Declined||None||T88122 bigbrother only watches users jobs if they already have a job running|
All users with .bigbrotherrcs:
sudo find /data/project /home -mindepth 2 -maxdepth 2 -type f -name .bigbrotherrc | sed -ne 's!/data/project/\([^/]\+\)/\.bigbrotherrc$!tools.\1!p;' -e 's!/home/\([^/]\+\)/\.bigbrotherrc$!\1!p;' | sort
All users with jobs in qstat:
qstat -u \* -xml | sed -ne 's/^ *<JB_owner>\(.*\)<\/JB_owner>$/\1/p;' | sort -u
Users in the first list who are not in the second are the problem. At this moment, there are only two which are irrelevant (one "user user" with the wish for a webservice, one "tool user" with an empty .bigbrotherrc).
It looks like basically at some point while the bigbrother.py process has been running a user will need to have an active job on the job grid for bigbrother to watch for a restart missing jobs. update_db looks at all running jobs on the grid and then checks for an rcfile for each job owner. Once the rcfile has been processed once it will be rechecked 0-60m minutes later.
One way to fix this bug would be to add a step between update_db and check_watches that looks for rcfiles that are not currently tracked in the internal state of the bigbrother.py process and loads them with read_config.