We need a gridengine replacement to schedule and manage arbitrary user applications in a flexible, user friendly way.
Potential candidates with substantial adoption, development community health, and feature-completeness are at the moment (IMO):
- Mesos+Marathon+Chronos
- Kubernetes
The current solution in use is OGE, so that'll also be evaluated just for comparison.
The chosen product should at least:
- Allow arbitrary processes to be executed on an arbitrary number of machines with specific resource requirements
- Respond to node crashes by rescheduling user processes on a different machine
- Configurable process isolation (memory, cpu and networking)
- Proper user authentication / authorization that can tie into our existing system.
- An interface flexible enough to allow fully mimicking our current tool labs workflows / setup
Bonuses:
- Allows running one off tasks interactively
- Cron-like functionality to run user processes at specific times