Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Aug 13 2024, 4:19 PM

Description

We've some issues with expensive analytics jobs causing our stat hosts to become unresponsive. To keep a single expensive job from destabilizing the whole host, we should implement cgroups.
Specifically, we want to limit users to "no more than x% of CPU" when the host is loaded, without limiting their CPU % during normal operating periods.

The current anaconda-wmf/conda-analytics enviroments use SystemdSpawner to create the jupyter jobs, so we could leverage systemd to implement cgroups.

Based on my reading of the Bullseye manpage and a quick check on stat1010, cgroups v2 is supported and thus we have a few more resource types we can use:

 cat /sys/fs/cgroup/cgroup.controllers
cpuset cpu io memory hugetlb pids rdma

Creating this ticket to:

Decide on implementation strategy
Implement

Details

	Subject	Repo	Branch	Lines +/-
	statistics hosts: enable CPUWeight (cgroupsv2)	operations/puppet	production	+44 -0
	stat (dse) hosts: enable CPU performance governor	operations/puppet	production	+3 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		bking	T373446 Improve developer experience on stat hosts (SRE-scoped)
		Resolved		bking	T372416 Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers

Event Timeline

bking created this task.Aug 13 2024, 4:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2024, 4:19 PM

There are some configuration options for the JupyterHub Systemdspawner which we use here.

Options that strike me as immediately interesting include mem_limit, cpu_limit, and slice.

The template that we use to configure it is: jupyterhub_config.py.erb

BTullis renamed this task from Implement cgroups for anaconda-wmf/conda-analytics to Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.Aug 13 2024, 4:49 PM

Gehel triaged this task as High priority.Aug 14 2024, 8:33 AM

Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Gehel moved this task from Scratch to Toil / Automation on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.08.17 - 2024.09.06); removed Data-Platform-SRE.Aug 16 2024, 9:51 AM

Change #1063237 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

gerritbot added a project: Patch-For-Review.Aug 16 2024, 6:31 PM

Change #1063237 merged by Bking:

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

Maintenance_bot removed a project: Patch-For-Review.Aug 16 2024, 7:30 PM

bking mentioned this in T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.Aug 16 2024, 8:07 PM

ArielGlenn subscribed.Aug 20 2024, 3:43 AM

We discussed this a bit at standup today. A couple of key points:

Enabling the performance governor in T362922 helped, but was not sufficient to solve our resource issues.
CPU seems to be the resource in short supply (although I/O could also be a concern).
It would help to have a better idea of the performance needs of each job. We should probably discuss this with developers and come up with an agreed-upon way to share our scarce CPU resources.

These points weren't explictily discussed, but I think they follow:

This problem will still exist when we move to Kubernetes.
Doing the work now on the current infrastructure will help inform the "requests" and "limits" values we would set in future Kubernetes jobs.

@bking re: IRC question "output a process list in the alert email body" we don't have facilities for that, however we do have resource utilization per-unit available, that will give you a breakdown of what jupyer users are doing, for example:

topk(15, sum by (id) (cluster_id:container_cpu_usage_seconds_total:rate5m{id=~".*jupyter.*\\.service$",site="eqiad",cluster="analytics"}) > 0)

and

topk(15, sum by (id) (cluster_id:container_memory_rss:sum{id=~".*jupyter.*\\.service$",site="eqiad",cluster="analytics"}) > 0)

HTH

bking closed subtask T373046: Create alerts for high resource utilization on the stat servers as Resolved.Aug 27 2024, 1:17 PM

bking mentioned this in T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 2:34 PM

bking removed a subtask: T373337: Create dashboards for stat servers.

bking removed a subtask: T373046: Create alerts for high resource utilization on the stat servers.

Gehel added a parent task: T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 3:04 PM

bking mentioned this in T372941: Review I/O setup on stat1008.Aug 27 2024, 9:55 PM

bking mentioned this in T373337: Create dashboards for stat servers.Aug 28 2024, 3:45 PM

I found T340492 and its unused kafka-stretch hosts. Just a heads-up that I'll be using kafka-stretch2001 for some cgroups v2 testing. It's ideal as it has the same number of cores as some of the stat hosts, and spinning disks (which are getting pretty hard to come by in the WMF infra ;) )

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Sep 3 2024, 3:13 PM

bking claimed this task.Sep 5 2024, 7:28 PM

bking moved this task from Backlog - operations to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.

Just an update as I've been doing quite a bit of research:

In terms of systemd properties we want to set CPUWeight, which only activates during resource contention, as opposed to CPUQuota which always applies. These behaviors are identical to Kubernetes CPU Requests and Limits as described here , as systemd and Kubernetes use the same underlying technology (cgroups).

SystemdSpawner does not natively support the CPUWeight, only CPUQuota, which means we can't use Systemdspawner to implement this change.

But, since systemdspawner already creates its jobs within a particular user's slice (cgroup hierarchy), we should be able to implement this by using the strategy described on systemd's user@service manpage :

The processes of a single user are collected under user-UID.slice. Resource limits for that user can be configured through drop-ins for that unit, e.g. /etc/systemd/system/user-1000.slice.d/resources.conf. If the limits should apply to all users instead, they may be configured through drop-ins for the truncated unit name, user-.slice. For example, configuration in /etc/systemd/system/user-.slice.d/resources.conf is included in all user-UID.slice units...

bking updated the task description. (Show Details)Sep 5 2024, 9:01 PM

Gehel edited projects, added Data-Platform-SRE (2024.09.06 - 2024.09.27); removed Data-Platform-SRE (2024.08.17 - 2024.09.06).Sep 6 2024, 9:48 AM

Gehel moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.

Change #1071238 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] statistics hosts: enable CPUWeight (cgroupsv2)

https://gerrit.wikimedia.org/r/1071238

gerritbot added a project: Patch-For-Review.Sep 6 2024, 3:58 PM

Change #1071238 merged by Bking:

[operations/puppet@production] statistics hosts: enable CPUWeight (cgroupsv2)

https://gerrit.wikimedia.org/r/1071238

Per the above merge, cgroups are now active in production. I've communicated this out in Slack via the #data-engineering-collab channel .

The next step is to notify active users via email. To that end, I've pulled a list of users who've logged in to the stat hosts in the last year using lastlog.

The next step is to match those user accounts to email addresses. Per IRC conversation with @MoritzMuehlenhoff , this can probably be accomplished by modifying this script .

Maintenance_bot removed a project: Patch-For-Review.Sep 17 2024, 5:30 PM

OK, email communication has been sent as of yesterday. That satisfies the AC of this ticket, so I'm closing out. We can reopen if we start to see negative effects.

Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat serversClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers
Closed, ResolvedPublic
Actions

Related Objects
Search...