Page MenuHomePhabricator

Replace bigbrother and ssh-cron-thingy with service manifests
Closed, ResolvedPublic

Description

bigbrother currently lets tool authors put a .bigbrotherrc file on their tool's homedir with a quasi-familiar language and have the bigbrother perl deamon watch for them and restart them if they are down. This is 'opt in', requiring tool authors to explicitly create this file. This can be used for continuous jobs as well as webservices.

ssh-cron-thingy is this mechanism by which doing a crontab -e on tools-login or elsewhere actually ssh's to tools-submit, and wraps around your cron command with a jsub so it runs on the grid instead of on the host itself.

So a service manifest would be one file that's clearly defined and extensible that lets tool authors specify all this, without having to worry about how it is executed.

It should support the following types of tasks, at least:

  1. Specifying running webservices of varying languages
  2. Flexible cron scheduling
  3. 'Continuous' jobs that are always running

It should also support (later on?):

  1. Monitoring (alert if X happens)
  2. Meta Info (Author, desc, License, etc?)

Example structure: P327

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: Aklapper, yuvipanda.

So it would be a deamon that's running on one of the hosts, constantly looking for any changes to manifest files in tools' directories. It doesn't keep external state (just in memory), so source of truth is always just the files themselves.

Cron is just maintained by having it edit the appropriate crontab, and wrap with jsub.

I'm not convinced that combining the upstart-like thing with the cron-like thing rather than having two separate mechanisms is the way to go? There seems to me to be very little in common between the two and that the result would end up being two distinct programs simply duct taped together.

The idea is that the entire state of 'what needs to be run to make sure this tool runs' is captured in the manifest, similar to heroku's procfiles (https://devcenter.heroku.com/articles/procfile). So I don't particularly care if we end up with two programs or one, as long as the information is in one standard form in one place.

Ah, yes, a single manifest is the cleanest way to go.

I wrote somewhere else that IIRC I wouldn't want to mash up bigbrother with jstart, and the same goes for crontab :-). In general, I'd prefer if we stick to standard Linux utilities so that users (or admins :-)) who have experience with those feel at (a familiar) home and those who don't learn something they can use elsewhere as well. But that falls into the wishlist category, not: "OMG!!!eleven! The world's gonna end!"

What I like about crontab and dislike about bigbrother is the former's feedbackness:

scfc@tools-login:~$ /usr/bin/crontab -l
You (scfc) are not allowed to use this program (/usr/bin/crontab)
See crontab(1) for more information
scfc@tools-login:~$

or:

scfc@tools-exec-15:~$ echo äöp | /usr/bin/crontab 
"-":0: bad minute
errors in crontab file, can't install.
scfc@tools-exec-15:~$

With bigbrother, I'm always not quite sure if it's .bigbrother, bigbrotherrc or .bigbrotherrc, if it picked up changes, if it complained about something, etc. Yes, there's bigbrother.log, but it's asynchronous to the update, and if you try to set up a .bigbrotherrc in a regular user's home directory, noone will complain at all.

So I'd favour if we could use a similar mechanism for "service manifests".

(And all of this reminded me that I still want to set up a proof-of-concept filter from .bigbrotherrc to munit configuration for T76850.)

(I'll note that the only thing I really want is to make sure that we have service manifests. How we do that is still up to discussion. Anti-NIHness is priority)

(will be moved to Gerrit soon, and renamed, etc)

As for software design, it will be implemented as multiple stateless deamons that can easily be run in multiple hosts for redundancy. They will each do the following in a loop:

  1. Collect all manifests
  2. Do 'something' about them (check if they are running, start them, fix cron, whatever)
  3. Go to 1

The first one will be a webservice maintainer of sorts, which will ensure that webservices are up for tools with a service manifest. Its step 2 should be:

  1. Fork a child process with the tool's uid
  2. Check if appropriate webservice is running
  3. If so, ok. If not, start it.

This feels nicer / safer to me than a bunch of sudo calls, but of course the code will need to be carefully reviewed to avoid security issues. Also limits need to be put in place to make sure that we don't end up forkbombing the host. This will also scale to when we have way too many machines by simple consistent hashing - the collect phase is still going to be fairly quick, and then we just run actions on different parts of the list from different hosts.

Well, if the bigbrother replacement would use sudo, we could just use monit :-).

I don't think you to fork child processes just for checking. Just parse qstat -xml | sed …, and there you'll have all the information about running jobs/web services for all users. Compare that to the list of jobs/web services that should be run, and you're done.

@scfc it will need to use sudo / fork/setuid at some point to start the webservice, no?

Yes, for the purpose of starting a job/web service (and that'd be no reason to use monit). But your comment above sounded like you thought you need to sudo/fork + setuid for the checks as well.

Aha, so I just found out that some of my problems were because I was trying to parse 'qstat -j "*" -xml' instead of 'qstat -u "*" -xml', and the latter is far simpler...

Ok, with the latest commit it kindof works \o/

Still a lot to be done before we can turn this one:

  1. Documentation, both internal and external
  2. Code cleanup
  3. Generate service.manifest for all webtools
  4. Logging, both for the service itself and to the users to notify them of restarts
  5. Rate limiting? (I am unsure if we should actually do this, and if we do it should be pretty non-aggressive)
  6. Convert to use asyncio, for much better performance (since this just blocks on io a lot, it should be a pretty good boost)
  7. Security review
  8. T95095

(just bigbrother replacement so far, btw)

Regarding rate limiting, the problem with bigbrother is that in situations where the infrastructure fails and thus job never endure, it has this 10 starts/d (?) limit.that can leave web services un-restarted for 23 hours when the infrastructure is back up. So IMHO it is important that the time window is small (one hour?).

As a Pythonewbie asyncio makes me shudder :-). Is performance that important for this tool?

Yeah rate limiting is going to be interesting but I think I would rather
start with no rate limiting and implement it than the other way around.
Also important are metrics so we know when we need to add features.

As for performance right now starting a webservice can take quite a few
seconds since it blocks and so if a large number of tools need restarting
it could take forever but maybe that isn't the end of the world.

I look at it the other way: If you start jobs in parallel, you need to take measures to not overload the server that the "launcher" runs on and to not overload the grid master (I don't know if that has any built-in protections against that). So if a large number of tools need restarting, probably due to a crash, I'd try to avoid putting another monkey on your back, especially since if there is a major outage, the maintainers of the 300. tool in line will not be overly thankful if their tool got back online a few minutes earlier :-).

Fair enough :)

/me strikes asyncio from list.

scfc triaged this task as Medium priority.Apr 6 2015, 5:53 AM

So webservices are done now. Need to set tasks up for cron and worker services.

yuvipanda claimed this task.

Marking this as resolved because:

  1. We switched to using webservice monitor
  2. Future efforts are going to be focused on moving things to k8s rather than patching our gridengine based setup more.