We've some issues with expensive analytics jobs causing our stat hosts to become unresponsive. To keep a single expensive job from destabilizing the whole host, we should implement cgroups.
Specifically, we want to limit users to "no more than x% of CPU" when the host is loaded, without limiting their CPU % during normal operating periods.
The current anaconda-wmf/conda-analytics enviroments use SystemdSpawner to create the jupyter jobs, so we could leverage systemd to implement cgroups.
Based on my reading of the Bullseye manpage and a quick check on stat1010, cgroups v2 is supported and thus we have a few more resource types we can use:
cat /sys/fs/cgroup/cgroup.controllers cpuset cpu io memory hugetlb pids rdma
Creating this ticket to:
- Decide on implementation strategy
- Implement