We need a gridengine replacement to schedule and manage arbitrary user applications in a flexible, user friendly way.
Potential candidates with substantial adoption, development community health, and feature-completeness are at the moment (IMO):
# Mesos+Marathon+Chronos
# Kubernetes
The current solution in use is OGE, so that'll also be evaluated just for comparison.
The chosen product should at least:
# Allow arbitrary processes to be executed on an arbitrary number of machines with specific resource requirements
# Respond to node crashes by rescheduling user processes on a different machine
# Configurable process isolation (memory, cpu and networking)
# Proper user authentication / authorization that can tie into our existing system.
Bonuses:
# Allows running one off tasks interactively
# An interface flexible enough to allow eventual transparent full migration from gridengine
# Cron-like functionality to run user processes at specific times