toolforge - Deprecate BigBrother in Grid Engine
Closed, ResolvedPublic

Description

BigBrother is a monitoring tool used by a few tools hosted on the toolforge grid engine, for more details:

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Bigbrother

We would like to deprecate it because:

  • Very few users are actually using it
  • Similar functionality exists in Kubernetes (Deployments will keep the minimum number of replicas running)
  • It's a maintenance burden we would prefer not to have / simplify cluster configuration
GTirloni created this task.Oct 30 2018, 9:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2018, 9:11 PM
Bstorm added a subscriber: Bstorm.Oct 30 2018, 10:09 PM

I've made an email list of maintainers I could find via LDAP. We can communicate out a schedule for deprecation pretty easily with that if we can come up with one :)

So what is the alternative, in the case of (son of) grid engine? I understand k8s has such feature built-in, but for many wiki-editing bots that has many tasks, running on grid is significantly easier than running on k8s, and many such bots (eg. those that listens on eventstreams) need a way to ensure they are continuously running.

Bigbrother seems like a false sense of security in some ways because it doesn't trigger alerts for reboot loops and things like that (which I've seen it doing before). So, I'm not sure it is providing very good service in the first place.

Really, what is needed is an alert on error or stoppage, which I think does go out sometimes via email already (though I haven't been happy with it as a tool maintainer myself--got nothing on many failures).

There are very few tools using bigbrother at this point, but I have to admit that some of those are pretty important and busy ones. Thinking...

bd808 added a subscriber: bd808.Oct 31 2018, 7:50 PM

many such bots (eg. those that listens on eventstreams) need a way to ensure they are continuously running

This is the complete collection of jobs tracked by bigbrother in its last eval loop:

2018-10-31 19:49:38.173248
tools.asurabot:subster_irc:STARTING:1541015291.87:1541015403.87
tools.asurabot:rahu:Rr:2018-06-06 18:06:21:0
tools.usrd-tools:start:r:2018-06-06 18:12:26:0
tools.usrd-tools:USRDbot:r:2018-06-06 18:05:40:0
tools.webarchivebot:WebArchiveBOT:r:2018-10-20 16:15:16:0
tools.cluebot3:cluebot3:r:2018-07-18 23:24:02:0
tools.wlm-de-utils:commonsbot:r:2018-06-06 18:32:21:0
tools.csbot:csbot8:Rr:2018-06-06 16:16:53:0
tools.stewardbots:sulwatcher:r:2018-10-22 17:11:54:0
tools.stewardbots:stewardbot:r:2018-10-27 19:53:11:0
tools.wikilinkbot:linkbotv11:r:2018-10-28 08:14:03:0
tools.serobot:serobot-books:r:2018-08-31 18:00:43:0
tools.serobot:serobot:r:2018-10-08 12:23:06:0
tools.algo-news:celery:r:2018-10-08 09:28:36:0
tools.cluebotng:cbng_bot:r:2018-10-31 18:32:03:0
tools.cluebotng:cbng_core:Rr:2018-10-19 01:00:17:0
tools.cluebotng:cbng_relay:r:2018-09-24 07:27:54:0
tools.embeddeddata:rcwatcher:r:2018-09-30 15:39:33:0
tools.embeddeddata:worker1:r:2018-09-30 15:39:50:0
tools.embeddeddata:worker3:r:2018-09-30 15:39:52:0
tools.embeddeddata:worker2:r:2018-10-01 08:18:35:0
tools.embeddeddata:worker5:r:2018-10-01 12:22:30:0
tools.embeddeddata:worker4:r:2018-09-30 15:39:56:0
tools.urbanecmbot:patrolTrusted:r:2018-10-27 17:18:09:0
tools.urbanecmbot:patrolSandbox:r:2018-10-27 17:18:09:0
tools.urbanecmbot:patrolAfterPatrol:r:2018-10-27 17:18:11:0
tools.iabot:worker1:r:2018-09-27 00:50:15:0

14 tool accounts out of 2151 total tool accounts (0.65%) are making use of this shared service. Low usage is not in and of itself a reason to deprecate the service, but it does give some idea of the relative importance of this service to the larger Toolforge developer community.

Some related tickets:

Here are all the times BigBrother acted in 2018. It seems to be mostly masquerading errors in code and/or misunderstanding in configuring crontabs/bigbrother.

I would like to propose we send an email to these 14 tools informing them BigBrother will be disabled in 30 days.

==> asurabot
Last event: 2018-11-22 14:50:10 info: Restarting job 'subster_irc'
Event count in year: 255151
Comment: Invalid command in .bigbrotherrc that never works (opened T210155)

==> cluebotng
Last event: 2018-11-22 14:48:08 info: Restarting job 'cbng_bot'
Event count in year: 4203
Comment: It seems bot finishes work and exits. BigBrother is used to run it again immediately. Non-continuous job masquerading as continuous through BigBrother.

==> stewardbots
Last event: 2018-11-20 10:23:46 info: Restarting job 'stewardbot'
Event count in year: 28
Comment: It seems bot finishes work and exits. BigBrother is used to run it again immediately. Non-continuous job masquerading as continuous through BigBrother.

==> urbanecmbot
Last event: 2018-11-18 01:38:08 info: Restarting job 'patrolAfterPatrol'
Event count in year: 378
Comment: Unhandled error causes the bot to die (usually "User 's52741' has exceeded the 'max_user_connections' resource (current value: 10)")

==> iabot
Last event: 2018-11-16 19:32:29 info: Restarting job 'worker1'
Event count in year: 1359
Comment: Racing condition (There is a crontab submitting the same BigBrother job every 5 minutes with the same name)

==> embeddeddata
Last event: 2018-11-14 17:15:37 info: Restarting job 'worker3'
Event count in year: 83
Comment: Unhandled Python exceptions

==> webarchivebot
Last event: 2018-10-20 16:15:14 info: Restarting job 'WebArchiveBOT'
Event count in year: 32
Comment: Unclear why process died

==> ipwatcher
Last event: 2018-10-28 13:28:10 info: Restarting job 'monitorEdits'
Event count in year: 648
Comment: Surge of failures on 7/11-12. Other failures are due to unhandled exceptions ("Table 's53595__ipwatcher.ips' doesn't exist")

==> wikilinkbot
Last event: 2018-10-28 08:14:02 info: Restarting job 'linkbotv11'
Event count in year: 8
Comment: Racing condition (There is a crontab submitting the same BigBrother job every 10 minutes with the same name)

==> serobot
Last event: 2018-08-31 18:00:41 info: Restarting job 'serobot-books'
Event count in year: 16
Comment: Unclear why process died

==> media-dubiety
Last event: 2018-08-08 18:45:03 info: Restarting job 'ircbot'
Event count in year: 521
Comment: Unclear why process died (surge of failures on 7/11, 4/1, 3/31 and 1/12)

==> cluebot3
Last event: 2018-07-18 23:24:00 info: Restarting job 'cluebot3'
Event count in year: 2
Comment: Probably first time running

==> wlm-de-utils
Last event: 2018-06-06 18:32:19 info: Restarting job 'commonsbot'
Event count in year: 2
Comment: Probably first time running

==> usrd-tools
Last event: 2018-06-06 18:12:24 info: Restarting job 'start'
Event count in year: 2
Comment: Probably first time running

So what is the alternative here, if the job must be restarted asap after it
exits? A crontab with ‘* * * * *’ to fire every minute to check the status
of the job?

(oh and no, generic grid exec nodes are not submit hosts, so I can’t just
fire up a continuous job that does the check periodically)

I don't know what is the alternative for grid jobs. In Kubernetes we could rely on the controllers taking care of this.

Suggestions are welcome!

I had time to check a few tools that had the racing condition status I mentioned previous. I assumed their crontab entries were enough and renamed .bigbrotherrc to bigbrotherrc.old (and added a comment explaining why). Tools were: wikilinkbot, iabot and asurabot. If the crontab isn't triggered frequently enough, we can adjust it, but it seems various tools were already using cron to re-submits jobs over and over so I think it makes the impact of deprecating BigBrother even smaller.

Mentioned in SAL (#wikimedia-cloud) [2018-11-23T12:37:15Z] <gtirloni> Renamed .bigbrotherrc to bigbrotherrc.old (Tool is running on Kubernetes and doesn't use Grid Engine / BigBrother anymore) T208357

GTirloni added a comment.EditedFri, Nov 23, 12:52 PM

Emailed maintainers asking permission to convert .bigbrotherrc to a crontab alternative:

  • cluebot3 & cluebotng
  • serobot
  • usrd-tools
  • stewardbots
  • webarchivebot
  • wlm-de-utils

I couldn't find the contact information for all of them so I emailed at least 1 from each project (and found other contacts when emails bounced).

GTirloni triaged this task as Normal priority.Fri, Nov 23, 1:20 PM
GTirloni claimed this task.

Change 477330 had a related patch set uploaded (by Urbanecm; owner: Giovanni Tirloni):
[labs/tools/urbanecmbot@master] Add script to ensure continuous jobs are always running

https://gerrit.wikimedia.org/r/477330

Change 477330 merged by jenkins-bot:
[labs/tools/urbanecmbot@master] Add script to ensure continuous jobs are always running

https://gerrit.wikimedia.org/r/477330

Change 478898 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge: Absent BigBrother

https://gerrit.wikimedia.org/r/478898

Change 478898 merged by GTirloni:
[operations/puppet@production] toolforge: Absent BigBrother

https://gerrit.wikimedia.org/r/478898

Change 478926 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolforge: Remove BigBrother puppet code

https://gerrit.wikimedia.org/r/478926

Change 478926 merged by GTirloni:
[operations/puppet@production] toolforge: Remove BigBrother puppet code

https://gerrit.wikimedia.org/r/478926

BigBrother has been removed from Toolforge.

Users are encouraged to use cron to restart their jobs (see docs).

GTirloni closed this task as Resolved.Tue, Dec 11, 1:18 PM

Mentioned in SAL (#wikimedia-cloud) [2018-12-11T13:18:59Z] <gtirloni> Removed BigBrother (T208357)