Page MenuHomePhabricator

Evaluate a 'cluster solution' for use on Tool Labs
Closed, ResolvedPublic

Description

We need a gridengine replacement to schedule and manage arbitrary user applications in a flexible, user friendly way.

Potential candidates with substantial adoption, development community health, and feature-completeness are at the moment (IMO):

  1. Mesos+Marathon+Chronos
  2. Kubernetes

The current solution in use is OGE, so that'll also be evaluated just for comparison.

The chosen product should at least:

  1. Allow arbitrary processes to be executed on an arbitrary number of machines with specific resource requirements
  2. Respond to node crashes by rescheduling user processes on a different machine
  3. Configurable process isolation (memory, cpu and networking)
  4. Proper user authentication / authorization that can tie into our existing system.
  5. An interface flexible enough to allow fully mimicking our current tool labs workflows / setup

Bonuses:

  1. Allows running one off tasks interactively
  2. Cron-like functionality to run user processes at specific times

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added a project: Cloud-Services.
yuvipanda subscribed.
yuvipanda updated the task description. (Show Details)

We need a gridengine replacement to schedule and manage arbitrary user applications in a flexible, user friendly way.

Potential candidates with substantial adoption, development community health, and feature-completeness are at the moment (IMO):

  1. Mesos+Marathon+Chronos
  2. Kubernetes

The chosen product should at least:

  1. Allow arbitrary processes to be executed on an arbitrary number of machines with specific resource requirements
  2. Respond to node crashes by rescheduling user processes on a different machine
  3. Cron-like functionality to run user processes at specific times
  4. Configurable process isolation (memory, cpu and networking)
  5. Proper user authentication / authorization that can tie into our existing system.
  6. Have a healthy development community
  7. Have substantial adoption
  8. Easy configurability/administration
    • preferrably compatible with a puppet/hiera model

Bonuses:

  1. Allows running one off tasks interactively
  2. Cron-like functionality to run user processes at specific times
    • imo, this is a 'bonus' as we can always use normal cron instead
  3. interface which maps closely to the existing interface

3b. If the interface maps poorly to the exiting one, then it being flexible enough to provide a legacy interface that does.

  1. Have a healthy development community
  2. Have substantial adoption
  3. Easy configurability/administration
    • preferrably compatible with a puppet/hiera model

All tracked in the spreadsheet: https://docs.google.com/spreadsheets/d/1YkVsd8Y5wBn9fvwVQmp9Sf8K9DZCqmyJ-ew-PAOb4R4/edit?usp=sharing already

Bonuses:

  1. Allows running one off tasks interactively
  2. Cron-like functionality to run user processes at specific times
    • imo, this is a 'bonus' as we can always use normal cron instead

Fair enough, although I think it'll be a fairly big bonus.

  1. interface which maps closely to the existing interface

GridEngine's? I think that might be a negative than a positive - I like Coren's 3b below better :)

yuvipanda updated the task description. (Show Details)

Updated based on feedback.

A few points:

  • Do you and the users really like the current interface? Do we want users to still ssh into the system? I don't think that's a good idea, and both of this systems do not allow that in a sane way, btw.
  • We cannot use cron, as in the traditional cron on these systems, so that functionality might be important.
  • I don't think being able to set up a node with puppet will be an issue, but I don't see us configuring the running state of any of these clusters via puppet.

So the actual quarterly goal is to make an alternate way to run webservices available. Currently webservices are run as gridengine jobs in precise or trusty environments that are provisioned by puppet, as single processes, with the code coming from NFS. I think logical migration step is to make them run in pre-built docker containers that are exact replicas of the current precise / trusty environments (in terms of packages available, that is), running code from NFS but on not-gridengine. This lets us take gridengine out of the equation. Once we complete this migration, we can just start allowing people to specify in generic declarative terms what containers they want to run, and if they want NFS or not, and then slowly migrate people to not want NFS...

  • Do you and the users really like the current interface? Do we want users to still ssh into the system? I don't think that's a good idea, and both of this systems do not allow that in a sane way, btw.

An interactive bastion from which people can submit jobs doesn't seem such a terrible idea. I think anything that requires people use git for everything is a non-starter atm, and we need a migration path from gridengine - so I think we need to have a bastion host. Of course, they don't need to be able to ssh into the execution nodes...

  • We cannot use cron, as in the traditional cron on these systems, so that functionality might be important.

Can't we have a traditional cron fire off containers at certain times? super ugly, yes - and I agree this functionality might be important.

  • Do you and the users really like the current interface? Do we want users to still ssh into the system? I don't think that's a good idea, and both of this systems do not allow that in a sane way, btw.

An interactive bastion from which people can submit jobs doesn't seem such a terrible idea. I think anything that requires people use git for everything is a non-starter atm, and we need a migration path from gridengine - so I think we need to have a bastion host. Of course, they don't need to be able to ssh into the execution nodes...

  • We cannot use cron, as in the traditional cron on these systems, so that functionality might be important.

Can't we have a traditional cron fire off containers at certain times? super ugly, yes - and I agree this functionality might be important.

Another point that should be evaluated along (maybe it's covered by the "Monitoring" part in the spreadsheet, but I thought I should add it) is accountability; if someone misuses a job to send e.g. spam or DDoS, are we able to easily pin down who did it.

@MoritzMuehlenhoff what kind of logs do you expect this to have? Just accounting of which jobs ran from which users at what times on what hosts?

@MoritzMuehlenhoff what kind of logs do you expect this to have? Just accounting of which jobs ran from which users at what times on what hosts?

At the minimum that. Bonus points if it also logs additional data/statistics on resource consumptions/network use (like how many megabytes of outbound traffic a job caused etc).

So the actual quarterly goal is to make an alternate way to run webservices available. […]

Why? ("Goal" vs. "means", etc.) What is the problem that we are trying to solve at the moment and that thus is the benchmark for an alternative to the current setup? (Besides SGE being … selectively user- and administrator-friendly.)

We currently have SGE, and bigbrother, and webservicemonitor, and all of them would probably complain of a lot of missing TLC. Adding another element doesn't decrease the maintenance burden.

In addition, a Docker-type solution is orthogonal to the current Toolforge paradigm. And if (cave performance & Co.) that would be the way forward, I'd rather jump head first in a Google Cloud-like setup where we don't have to make users' existing software work, but say: Here is this shiny new thing, and if you want a bite, here's your account. And BTW: SCM mandatory. No legacy! :-) So +1 to #Tool-Cloud, but I don't want to see it hammered into the current environment.

I don't think a brand new environment will work - we had that opportunity during the toolserver migration but didn't take it (IMO). The only way to actually be able to kill our current set of things (SGE, NFS, bigbrother, webservicemonitor, the cronhack) is to remove them one at a time while keeping the interface consistent in a legacy setup while also providing new features (the entire cloud / container thingy) for people to make use of them. Otherwise we end up having to support both of them which is untenable... SCM mandatory + you have to redo a bunch of things and we're going to stop supporting the old thing = (well deserved) pitchforks :)

a deprecation schedule would be:

  1. SGE for webservices
  2. webservicemonitor
  3. SGE for continuous jobs
  4. bigbrother
  5. SGE (all of it!)
  6. NFS (eventually)

You get pitchforks if you force (however gently) all users to migrate to yet another new system with no reasoning and no compensation for the disturbance.

So if the reason is that ordre du mufti is to move to a Docker/container setup and scuttle Toolforge in its present form, then that should be part of the specification in this task's description so that solutions can be measured with that scale.

You get pitchforks if you force (however gently) all users to migrate to yet another new system with no reasoning and no compensation for the disturbance.

So if the reason is that ordre du mufti is to move to a Docker/container setup and scuttle Toolforge in its present form, then that should be part of the specification in this task's description so that solutions can be measured with that scale.

Ah, oops. So to quote from the quarterly goal again: Allow Tool Labs users to experiment with their web services on the new cluster environment and that's all this is. There's no migration planned, no deprecation of gridengine planned. I've been deep into it in my head that I've not been communicating clearly, and I apologize. I should've said 'offer an alternative to X so good that people would want to switch' earlier instead of deprecate. There is a lot of things you can't do with GridEngine

While not making any promises about supporting GridEngine forever, I can promise that we have no plans to force anyone to do any extra work at all to migrate in any form or way. Whatever we do, I think it'll be a hard requirement that all the current workflows are supported for at least the next few years undisturbed, with absolutely no work required from tools devs - I'll add this to the evaluation criteria explicitly.

So to re-iterate:

  1. This evaluation is for an opt-in alternative to GridEngine, rather than a replacement. But it should at least be as good as gridengine, and hence the confusion.
  2. Anything this does will absolutely not require any tools devs to have to change their workflow unless they want new features. All the current stuff will continue working without any changes to workflow for a long time. It'll just provide more opt in features
  3. If there is a change in the future of the default, it'll be done and tested in such a way that it requires no extra work from the Tool developers themselves. The webservice -> webservice2 move didn't require any extra work from devs, and neither did the simplification of the webproxy architecture - this will be done similarly.

Hopefully that clarifies things a bit. Thanks for bringing it up pre-pitchforks :)

Some general comments on the discussion:

  • The current toollabs has a lot of deficiencies and stability issues. It is also based on an almost-abandonware that wasn't thought to run webservices.
  • I think the execution paradigm that both kubernetes and marathon offer is superior compared to what OGE offers, and they also offer increased isolation which is a nice security bonus.

So the problem I would like to solve is "find a way out of grid engine and our complete dependency on NFS" and also "make it easier for users to manage their own tools".

I think that most tools authors would be happy to be able to push their code to a specific VCS remote branch and see it deployed automatically, to be guaranteed no manual intervention is needed if a node goes down, and to be ensured that one "rogue" tool (or labs instance) writing a gazillion bytes to NFS won't kill they own tool performance. I also imagine there are people that like the way they operate now/don't want to be bothered learning new things, and we will have to account for them - hence I concur with @yuvipanda statement: we won't dismiss OGE now or in any near future.

We are also open to consider the results of this experiment: we're trying to build a saner, more comfortable environment for tools authors, but we're well open to accept the negative feedback and either try to improve on that or to declare the experiment aborted.

Finally, while I don't think we want to go "git only" either now or in the future, we do want to offer tool authors some easy interface to administering their tool. I wasn't sure sshing into a bastion is the best way to do that. It could be as easy as using a web interface or using a cli tool from your terminal. similar to what most remote computing environments offer their users.

Hi, I work on Kubernetes, and I'm new to the wikimedia environment. I see your requirement for "Proper user authentication / authorization that can tie into our existing system." Where could I find out how your existing system works, so I can perhaps work on a solution for you.

So the current setups Authz/n is:

  1. Users ssh in to a bastion host. Groups / Users / Keys are managed in LDAP
  2. A unit of authn in Grid Engine is the unix user. So once I'm ssh'd in, I can manipulate all jobs submitted by my unix user (Create, delete, get accounting info, etc)
  3. All jobs are also run as the user who submitted them. This is important since NFS is used to share code and data between instances.

Thanks for checking in, @Etune! Awesome to have you her :) Let's take the Kubernetes specific conversation to T107993?

We have decided to go with Kubernetes. Announcement coming soon :)

chasemp claimed this task.