Page MenuHomePhabricator

Improve developer experience on stat hosts part 2
Closed, ResolvedPublic

Description

Per this Slack thread, stat1011 went unresponsive today. As such, we need to continue improving the developer experience on stat hosts (as started in T373446). This includes:

  • Tuning alert thresholds (we did not get an alert even after the host was unresponsive for some time). It might be a good idea to send alerts to our shared Slack channels as well.
  • Performance optimizations and workload protections via cgroups (resource controls).

Update dashboards to include user-specific I/O stats Removed as AC, because it was memory, not I/O, that turned out to be the scare resource.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Commit prior state of playbookrepos/search-platform/sre/ansible-cgroup!2bkingcgroupmain
Customize query in GitLab

Event Timeline

Gehel triaged this task as High priority.Oct 4 2024, 8:33 AM

Change #1079021 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] data-platform: alert on load15 > 32

https://gerrit.wikimedia.org/r/1079021

Change #1079021 merged by jenkins-bot:

[operations/alerts@master] data-platform: alert on load15 > 32

https://gerrit.wikimedia.org/r/1079021

Adding some observations from our slack thread .

@mfossati mentioned there are 2 types of jobs (edited):

  • Type 1 runs on the cluster. For example, the former is typically a PySpark script
  • Type 2 Run locally on a stat box. For example, model training (read CPU/GPU usage) or dataset collection (read network/disk I/O). I assume that resources for cluster jobs (e.g., executor memory/cores) won’t impact stat boxes, while local jobs are much more delicate, so I generally try to be conservative beforehand or reduce resources afterwards.

I think these are reasonable expectations for our software engineers; they should be able to throw jobs at the stat boxes without much thought about resource usage. If the job asks for too many resources, it should fail cleanly without locking up the host. Our job as SREs is to implement this ;) .

Update: per this Slack thread, @fkaelin helped us get a reproducer:

Login to stat1011 , then:

import polars as pl
df = pl.read_csv("/home/fab/page_history_test_small_csv_single/combined.csv")

Before I disabled numa, this step would freeze up the machine with a high sys/wait utilization- without ever "giving data" to the user, i.e. usr utilization would always be near 0. After that, loading could complete without freezing the box (and the df is in memory - e.g. you can get random access to any of the 981170283 rows). It uses 50GB / %40 of the available memory.

Now trying to run a simple aggregation on the dataframe should only use memory (beware if you do this, be ready to
kill the pid as it will freeze the box)

df.unique("wiki_db").collect()

However, strangely this leads to a similar issue as described above, first the wait % would spike, and the sys % would eventually hover in the 90s.

Thanks to @BTullis for pointing out this Puppet code . I now believe that this code, not numa, was causing the hosts to seize up at 50% RAM utilization. Because of the large gap between MemoryHigh (when the system starts to aggressively reclaim memory) and MemoryMax (when it actually kills the process) , this led to a state where the system was unable to recover. Turning off numa helped, but did not fix the root cause.

Since this patch was merged , we've set MemoryMax and MemoryHigh to the same value, 95%. This preserves 5% of memory of system processes (which rarely go over 2%). Since the values are now the same, there shouldn't be much thrashing, just a clean kill of the offending user process. User jobs will also be able to access almost twice as much memory without affecting overall system stability. Note that this does not prevent resource contention between user processes, it just (hopefully) stops the freezing.

bking claimed this task.
bking updated the task description. (Show Details)

The AC has been satisfied, so I'm closing out this ticket. Work to improve the Jupyter/Analytics client experience continues in T378735...