Replace bigbrother and ssh-cron-thingy with service manifests
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	yuvipanda
	Feb 24 2015, 10:24 AM

Description

bigbrother currently lets tool authors put a .bigbrotherrc file on their tool's homedir with a quasi-familiar language and have the bigbrother perl deamon watch for them and restart them if they are down. This is 'opt in', requiring tool authors to explicitly create this file. This can be used for continuous jobs as well as webservices.

ssh-cron-thingy is this mechanism by which doing a crontab -e on tools-login or elsewhere actually ssh's to tools-submit, and wraps around your cron command with a jsub so it runs on the grid instead of on the host itself.

So a service manifest would be one file that's clearly defined and extensible that lets tool authors specify all this, without having to worry about how it is executed.

It should support the following types of tasks, at least:

Specifying running webservices of varying languages
Flexible cron scheduling
'Continuous' jobs that are always running

It should also support (later on?):

Monitoring (alert if X happens)
Meta Info (Author, desc, License, etc?)

Example structure: P327

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T90534 Make toolforge reliable enough (tracking)
Resolved	yuvipanda	T90561 Replace bigbrother and ssh-cron-thingy with service manifests
Resolved	yuvipanda	T94964 Make webservice / webservice2 write out a service manifest when used
Resolved	yuvipanda	T95095 Ensure that all running webservices have a services.manifest file
Resolved	yuvipanda	T95210 Review and productionize webservice manifest monitor
Resolved	yuvipanda	T95255 Create debian package for service manifest monitor
Resolved	yuvipanda	T95256 Send metrics from service manifest monitor to graphite
Resolved	yuvipanda	T95521 Make service manifest monitors redundant / hotswappable

Event Timeline

yuvipanda created this task.Feb 24 2015, 10:24 AM

yuvipanda raised the priority of this task from to Needs Triage.

yuvipanda updated the task description. (Show Details)

yuvipanda added projects: Cloud-Services, Tracking-Neverending, Toolforge.

yuvipanda added subscribers: Aklapper, yuvipanda.

yuvipanda updated the task description. (Show Details)Feb 24 2015, 10:26 AM

yuvipanda set Security to None.

JanZerebecki subscribed.Feb 24 2015, 1:03 PM

Joe subscribed.Feb 25 2015, 7:44 AM

So it would be a deamon that's running on one of the hosts, constantly looking for any changes to manifest files in tools' directories. It doesn't keep external state (just in memory), so source of truth is always just the files themselves.

Cron is just maintained by having it edit the appropriate crontab, and wrap with jsub.

I'm not convinced that combining the upstart-like thing with the cron-like thing rather than having two separate mechanisms is the way to go? There seems to me to be very little in common between the two and that the result would end up being two distinct programs simply duct taped together.

The idea is that the entire state of 'what needs to be run to make sure this tool runs' is captured in the manifest, similar to heroku's procfiles (https://devcenter.heroku.com/articles/procfile). So I don't particularly care if we end up with two programs or one, as long as the information is in one standard form in one place.

Ah, yes, a single manifest is the cleanest way to go.

I wrote somewhere else that IIRC I wouldn't want to mash up bigbrother with jstart, and the same goes for crontab :-). In general, I'd prefer if we stick to standard Linux utilities so that users (or admins :-)) who have experience with those feel at (a familiar) home and those who don't learn something they can use elsewhere as well. But that falls into the wishlist category, not: "OMG!!!eleven! The world's gonna end!"

What I like about crontab and dislike about bigbrother is the former's feedbackness:

scfc@tools-login:~$ /usr/bin/crontab -l
You (scfc) are not allowed to use this program (/usr/bin/crontab)
See crontab(1) for more information
scfc@tools-login:~$

or:

scfc@tools-exec-15:~$ echo äöp | /usr/bin/crontab 
"-":0: bad minute
errors in crontab file, can't install.
scfc@tools-exec-15:~$

With bigbrother, I'm always not quite sure if it's .bigbrother, bigbrotherrc or .bigbrotherrc, if it picked up changes, if it complained about something, etc. Yes, there's bigbrother.log, but it's asynchronous to the update, and if you try to set up a .bigbrotherrc in a regular user's home directory, noone will complain at all.

So I'd favour if we could use a similar mechanism for "service manifests".

(And all of this reminded me that I still want to set up a proof-of-concept filter from .bigbrotherrc to munit configuration for T76850.)

(I'll note that the only thing I really want is to make sure that we have service manifests. How we do that is still up to discussion. Anti-NIHness is priority)

Startin

yuvipanda moved this task from Backlog to Tech Debt / Simplification on the ToolLabs-Goals-Q4 board.Mar 25 2015, 9:43 PM

Giftpflanze subscribed.Mar 30 2015, 8:32 PM

yuvipanda mentioned this in T94883: Configure web services in such a way that users don't have to (re)start it ever.Apr 2 2015, 8:31 PM

jeremyb subscribed.Apr 2 2015, 10:05 PM

Pathoschild subscribed.Apr 3 2015, 12:59 AM

Matanya subscribed.Apr 3 2015, 1:20 PM

• Elitre subscribed.Apr 3 2015, 3:16 PM

yuvipanda closed subtask T94964: Make webservice / webservice2 write out a service manifest when used as Resolved.Apr 3 2015, 9:51 PM

Beginnings of this going on at https://github.com/yuvipanda/tools-manager

(will be moved to Gerrit soon, and renamed, etc)

As for software design, it will be implemented as multiple stateless deamons that can easily be run in multiple hosts for redundancy. They will each do the following in a loop:

Collect all manifests
Do 'something' about them (check if they are running, start them, fix cron, whatever)
Go to 1

The first one will be a webservice maintainer of sorts, which will ensure that webservices are up for tools with a service manifest. Its step 2 should be:

Fork a child process with the tool's uid
Check if appropriate webservice is running
If so, ok. If not, start it.

This feels nicer / safer to me than a bunch of sudo calls, but of course the code will need to be carefully reviewed to avoid security issues. Also limits need to be put in place to make sure that we don't end up forkbombing the host. This will also scale to when we have way too many machines by simple consistent hashing - the collect phase is still going to be fairly quick, and then we just run actions on different parts of the list from different hosts.

Well, if the bigbrother replacement would use sudo, we could just use monit :-).

I don't think you to fork child processes just for checking. Just parse qstat -xml | sed …, and there you'll have all the information about running jobs/web services for all users. Compare that to the list of jobs/web services that should be run, and you're done.

@scfc it will need to use sudo / fork/setuid at some point to start the webservice, no?

Yes, for the purpose of starting a job/web service (and that'd be no reason to use monit). But your comment above sounded like you thought you need to sudo/fork + setuid for the checks as well.

Aha, so I just found out that some of my problems were because I was trying to parse 'qstat -j "*" -xml' instead of 'qstat -u "*" -xml', and the latter is far simpler...

Ok, with the latest commit it kindof works \o/

Still a lot to be done before we can turn this one:

Documentation, both internal and external
Code cleanup
Generate service.manifest for all webtools
Logging, both for the service itself and to the users to notify them of restarts
Rate limiting? (I am unsure if we should actually do this, and if we do it should be pretty non-aggressive)
Convert to use asyncio, for much better performance (since this just blocks on io a lot, it should be a pretty good boost)
Security review
T95095

(just bigbrother replacement so far, btw)

Regarding rate limiting, the problem with bigbrother is that in situations where the infrastructure fails and thus job never endure, it has this 10 starts/d (?) limit.that can leave web services un-restarted for 23 hours when the infrastructure is back up. So IMHO it is important that the time window is small (one hour?).

As a Pythonewbie asyncio makes me shudder :-). Is performance that important for this tool?

Yeah rate limiting is going to be interesting but I think I would rather
start with no rate limiting and implement it than the other way around.
Also important are metrics so we know when we need to add features.

As for performance right now starting a webservice can take quite a few
seconds since it blocks and so if a large number of tools need restarting
it could take forever but maybe that isn't the end of the world.

I look at it the other way: If you start jobs in parallel, you need to take measures to not overload the server that the "launcher" runs on and to not overload the grid master (I don't know if that has any built-in protections against that). So if a large number of tools need restarting, probably due to a crash, I'd try to avoid putting another monkey on your back, especially since if there is a major outage, the maintainers of the 300. tool in line will not be overly thankful if their tool got back online a few minutes earlier :-).

Fair enough :)

/me strikes asyncio from list.

scfc triaged this task as Medium priority.Apr 6 2015, 5:53 AM

scfc mentioned this in T94500: bigbrother doesn't stop.

scfc moved this task from Backlog to Ready to be worked on on the Toolforge board.Apr 6 2015, 8:24 AM

yuvipanda added a project: Labs-Q4-Sprint-2.Apr 7 2015, 3:00 AM

yuvipanda closed subtask T95095: Ensure that all running webservices have a services.manifest file as Resolved.Apr 9 2015, 12:29 AM

So webservices are done now. Need to set tasks up for cron and worker services.

• MZMcBride subscribed.Apr 15 2015, 3:40 AM

Ricordisamoa subscribed.Apr 16 2015, 2:54 PM

yuvipanda closed subtask T95210: Review and productionize webservice manifest monitor as Resolved.Apr 25 2015, 9:16 AM

Technical13 subscribed.Apr 28 2015, 11:25 PM

yuvipanda mentioned this in T95798: Crontabs are not backed up.May 4 2015, 9:34 PM

Luke081515 moved this task from Triage to Backlog on the Cloud-Services board.Mar 25 2016, 4:11 PM

tom29739 subscribed.Apr 4 2016, 3:37 PM

Marking this as resolved because:

We switched to using webservice monitor
Future efforts are going to be focused on moving things to k8s rather than patching our gridengine based setup more.

Danny_B moved this task from Tag to Should be Goal instead on the Tracking-Neverending board.Jul 9 2016, 1:51 PM

• Phabricator_maintenance added a project: Goal.Aug 13 2016, 8:41 PM

• Phabricator_maintenance removed a project: Tracking-Neverending.Aug 13 2016, 9:01 PM

zhuyifei1999 mentioned this in T155494: giftbot webservice outages and/or issues.Jan 17 2017, 3:03 PM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:54 PM

Replace bigbrother and ssh-cron-thingy with service manifestsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Replace bigbrother and ssh-cron-thingy with service manifests
Closed, ResolvedPublic
Actions

Related Objects
Search...