Page MenuHomePhabricator

Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers
Closed, ResolvedPublic

Description

We've some issues with expensive analytics jobs causing our stat hosts to become unresponsive. To keep a single expensive job from destabilizing the whole host, we should implement cgroups.
Specifically, we want to limit users to "no more than x% of CPU" when the host is loaded, without limiting their CPU % during normal operating periods.

The current anaconda-wmf/conda-analytics enviroments use SystemdSpawner to create the jupyter jobs, so we could leverage systemd to implement cgroups.

Based on my reading of the Bullseye manpage and a quick check on stat1010, cgroups v2 is supported and thus we have a few more resource types we can use:

 cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma

Creating this ticket to:

  • Decide on implementation strategy
  • Implement

Event Timeline

There are some configuration options for the JupyterHub Systemdspawner which we use here.

Options that strike me as immediately interesting include mem_limit, cpu_limit, and slice.

The template that we use to configure it is: jupyterhub_config.py.erb

BTullis renamed this task from Implement cgroups for anaconda-wmf/conda-analytics to Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.Aug 13 2024, 4:49 PM
Gehel triaged this task as High priority.Aug 14 2024, 8:33 AM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.
Gehel moved this task from Scratch to Toil / Automation on the Data-Platform-SRE board.

Change #1063237 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

Change #1063237 merged by Bking:

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

We discussed this a bit at standup today. A couple of key points:

  • Enabling the performance governor in T362922 helped, but was not sufficient to solve our resource issues.
  • CPU seems to be the resource in short supply (although I/O could also be a concern).
  • It would help to have a better idea of the performance needs of each job. We should probably discuss this with developers and come up with an agreed-upon way to share our scarce CPU resources.

These points weren't explictily discussed, but I think they follow:

  • This problem will still exist when we move to Kubernetes.
  • Doing the work now on the current infrastructure will help inform the "requests" and "limits" values we would set in future Kubernetes jobs.

@bking re: IRC question "output a process list in the alert email body" we don't have facilities for that, however we do have resource utilization per-unit available, that will give you a breakdown of what jupyer users are doing, for example:

topk(15, sum by (id) (cluster_id:container_cpu_usage_seconds_total:rate5m{id=~".*jupyter.*\\.service$",site="eqiad",cluster="analytics"}) > 0)

and

topk(15, sum by (id) (cluster_id:container_memory_rss:sum{id=~".*jupyter.*\\.service$",site="eqiad",cluster="analytics"}) > 0)

HTH

I found T340492 and its unused kafka-stretch hosts. Just a heads-up that I'll be using kafka-stretch2001 for some cgroups v2 testing. It's ideal as it has the same number of cores as some of the stat hosts, and spinning disks (which are getting pretty hard to come by in the WMF infra ;) )

Just an update as I've been doing quite a bit of research:

  • In terms of systemd properties we want to set CPUWeight, which only activates during resource contention, as opposed to CPUQuota which always applies. These behaviors are identical to Kubernetes CPU Requests and Limits as described here , as systemd and Kubernetes use the same underlying technology (cgroups).
  • SystemdSpawner does not natively support the CPUWeight, only CPUQuota, which means we can't use Systemdspawner to implement this change.
  • But, since systemdspawner already creates its jobs within a particular user's slice (cgroup hierarchy), we should be able to implement this by using the strategy described on systemd's user@service manpage :

The processes of a single user are collected under user-UID.slice. Resource limits for that user can be configured through drop-ins for that unit, e.g. /etc/systemd/system/user-1000.slice.d/resources.conf. If the limits should apply to all users instead, they may be configured through drop-ins for the truncated unit name, user-.slice. For example, configuration in /etc/systemd/system/user-.slice.d/resources.conf is included in all user-UID.slice units...

Change #1071238 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] statistics hosts: enable CPUWeight (cgroupsv2)

https://gerrit.wikimedia.org/r/1071238

Change #1071238 merged by Bking:

[operations/puppet@production] statistics hosts: enable CPUWeight (cgroupsv2)

https://gerrit.wikimedia.org/r/1071238

Per the above merge, cgroups are now active in production. I've communicated this out in Slack via the #data-engineering-collab channel .

The next step is to notify active users via email. To that end, I've pulled a list of users who've logged in to the stat hosts in the last year using lastlog.

The next step is to match those user accounts to email addresses. Per IRC conversation with @MoritzMuehlenhoff , this can probably be accomplished by modifying this script .

OK, email communication has been sent as of yesterday. That satisfies the AC of this ticket, so I'm closing out. We can reopen if we start to see negative effects.