Page MenuHomePhabricator

notebook/stat server(s) running out of memory
Open, NormalPublic

Description

notebook1004 (and probably other notebook servers) keep running out of memory every once in a while when a user runs some large jobs, for example R jobs. (there was a comment that R's approach to memory management is not very efficient)

When it runs out of memory this typically kills nagios-nrpe-server and that leads to all the monitoring checks via NRPE being broken which leads to IRC spam like this:

18:52 <+icinga-wm> PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
18:52 <+icinga-wm> PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
18:52 <+icinga-wm> PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
18:52 <+icinga-wm> PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
18:52 <+icinga-wm> PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
18:52 <+icinga-wm> PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused

Restarting nagios-nrpe-server leads to recovery until moments later the same thing happens again.

In this specific case i have notified users with "echo | wall" and it actually worked but it seems it needs a permanent solution with quota or some other way to ensure users can't use all of the memory.

Event Timeline

Dzahn created this task.Jan 3 2019, 12:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 3 2019, 12:23 AM
Dzahn added a comment.Jan 3 2019, 12:26 AM
Jan  2 22:33:15 notebook1004 kernel: [9646042.221155] R invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0-1, order=0, oom_score_adj=0

Jan  3 00:06:33 notebook1004 bash[4971]: # Native memory allocation (mmap) failed to map 44564480 bytes for committing reserved memory.

Jan  3 00:06:30 notebook1004 bash[4971]: # There is insufficient memory for the Java Runtime Environment to continue.

Jan  3 00:12:20 notebook1004 jupyterhub[22114]:     OSError: [Errno 12] Cannot allocate memory
Milimetric triaged this task as High priority.Jan 3 2019, 6:03 PM
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric lowered the priority of this task from High to Normal.Jan 3 2019, 6:10 PM
Milimetric added a subscriber: Milimetric.

A proper fix is to manage resources through containerization (kubernetes), so marking low priority for now as other solutions we could think of are a little hacky.

elukey added a comment.Jan 3 2019, 6:15 PM

A couple of things that we discussed with the team:

  • this is the same problem that happens on stat machines, sometimes users are not conservative in their usage of those hosts consuming all the resources and triggering the OOM.
  • the nagios server could have a adjusted score for the OOM to avoid being killed when it is reclaiming memory (probably not really feasible to do).
  • Basic ulimits could prevent a process to eat up all the memory (this is usually the main issue), we could start with it and see how it goes.
elukey renamed this task from notebook server(s) running out of memory to notebook/stat server(s) running out of memory.Jan 3 2019, 6:16 PM
elukey added a project: User-Elukey.

I am pretty ignorant about it, but would cgroups fit in this use case? @MoritzMuehlenhoff ?

Just as an extra data point, early morning 2019-01-22 nagios-nrpe-server crashed on stat1007 from a cannot allocate error.

Mentioned in SAL (#wikimedia-operations) [2019-01-24T18:53:15Z] <mutante> notebook1003 - restarted nagios-nrpe-server... T212824

Mentioned in SAL (#wikimedia-operations) [2019-01-25T17:17:02Z] <chaomodus> notebook1003 restarted nagios-nrpe-server due to oom - T212824

Thanks a lot for all the work :)

I am leaning towards having a cgroup that limits CPU/memory usage of all processes to something like 80/85%, so hopefully this will give people the ability to crunch data without risking to trigger the OOM killer. I need to document and learn how to properly apply cgroups (and not sure if we have anything generic/ready-to-use in puppet).

elukey claimed this task.Jan 26 2019, 8:06 AM
elukey raised the priority of this task from Normal to High.

Quick idea that could alleviate this issue: if we create a cgroup like https://wiki.archlinux.org/index.php/cgroups#Matlab on the stat/notebook hosts, and ask the users to either run their script with something like cgexec -g memory,cpuset:matlab prepended or via a wrapper (like `wmf-group-exec mycommand -etc..) we could place all computations in a cgroup that can maximum consume 80/85% of ram, avoiding to saturate the host. We could show up a warning in motd to users that a cgroup needs to be used for computations, and follow up with whoever doesn't do it.

elukey added a subscriber: aborrero.Feb 2 2019, 7:37 PM

After a chat with Chase, it seems that @aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with puppet? Ideally in my use case if a user belongs to certain groups (say analytics-privatedata-users, etc..) its login session + processes should all go into a cgroup (create beforehand) that limits the memory usage to something like 85% of the overall (to leave space for OS daemons etc.. and possibly avoid the OOM killer parties). So basically having users of certain groups to create processes under the same cgroup, limited in memory usage (and maybe cpu cores used?).

Dzahn added a comment.EditedFeb 3 2019, 12:45 AM

@aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with puppet?

This one looks interesting:

modules/cgred/init.pp

# Establishing cgroups and enforcing them using cgrules engine
#
# This module is oriented towards the workflow of defining
# cgroups and settings by logical unit and using the
# cgrulesengd to enforce.  The init script for cgrulesengd
# applies cgroups to this effect.
#

class cgred {
    package { [
        'cgroup-bin',
...

@aborrero has already done a similar thing for the tool-forge hosts. Anything that we can share with puppet?

This one looks interesting:
modules/cgred/init.pp

cgred is deprecated. You can find what I did for Toolforge here: T210098: Port cgroup restrictions and definitions to systemd/stretch for bastions, specifically T210098#4773756.

TL;DR: I just created this doc page: https://wikitech.wikimedia.org/wiki/Systemd_resource_control

elukey moved this task from Backlog to In Progress on the User-Elukey board.Feb 4 2019, 5:51 PM
elukey added a comment.EditedFeb 5 2019, 2:25 PM

@aborrero thanks a lot! As far as I can see the limits are applied for each user separately, but my use case is a bit different - I'd need to add a user login session (and hence all its processes) into a cgroup, but only if the user belongs to certain groups. Basically I'd like to allow a playground for people to crunch data as they need, but in a way that doesn't saturate all the machine leading to OOMs. Do you think that it is feasible?

EDIT: I am reading about user-.slice, missed at first pass, will see how it work first!

Change 488077 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce systemd::slice::user

https://gerrit.wikimedia.org/r/488077

Change 488078 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce profile::analytics::cluster::limits::statistics

https://gerrit.wikimedia.org/r/488078

Change 488077 merged by Elukey:
[operations/puppet@production] Introduce systemd::slice::all_users

https://gerrit.wikimedia.org/r/488077

Mentioned in SAL (#wikimedia-operations) [2019-02-12T17:54:12Z] <chaomodus> notebook1003 - restarted nagios-nrpe-server T212824

Change 488078 abandoned by Elukey:
Introduce profile::analytics::cluster::limits::statistics

https://gerrit.wikimedia.org/r/488078

Today I checked notebook1003 using the command systemd-cgls memory, that should show how the cgroups for memory settings are related to each other. Some notes:

  • notebook and spark processes are under the system.slice, and they represent the bulk of the memory consumption of the hosts
  • user-.slice is not mentioned (but it is if I run systemd-cgls, without memory). IIUC it doesn't represent any cgroup but only a template.

My idea would be the following:

  • create a notebook-users.slice unit that comes after system.slice, and that the notebook systemd units needs to include in their config (should be easy via puppet).
  • create memory and cpu limits for notebook-users.slice, that IIUC should apply to all the processes in the slice (so a sort of global limit for notebooks).

In this way I'd have a transparent way to handle a single cgroup that limits notebook usage. Need to figure out how to make sure that spark processes follow the same path, but it could be a second step. @aborrero does it sound feasible or did I get the systemd slices wrong?

GTirloni added a subscriber: diego.Mar 2 2019, 7:14 PM
mforns lowered the priority of this task from High to Normal.Mar 25 2019, 4:29 PM

This is my bad, I should I have followed up on this task. There are more variables since I added my last comment to add in here, sorry for the extra ping Arturo :)

diego added a comment.Mar 25 2019, 4:45 PM

Finally it's not just me squeezing notebooks memory :)

Today I checked notebook1003 using the command systemd-cgls memory, that should show how the cgroups for memory settings are related to each other. Some notes:

  • notebook and spark processes are under the system.slice, and they represent the bulk of the memory consumption of the hosts
  • user-.slice is not mentioned (but it is if I run systemd-cgls, without memory). IIUC it doesn't represent any cgroup but only a template.

My idea would be the following:

  • create a notebook-users.slice unit that comes after system.slice, and that the notebook systemd units needs to include in their config (should be easy via puppet).
  • create memory and cpu limits for notebook-users.slice, that IIUC should apply to all the processes in the slice (so a sort of global limit for notebooks).

In this way I'd have a transparent way to handle a single cgroup that limits notebook usage. Need to figure out how to make sure that spark processes follow the same path, but it could be a second step. @aborrero does it sound feasible or did I get the systemd slices wrong?

It does sound feasible :-) system.slice, user.slice, etc are all default slices created by systemd. AFAIK you can create arbitrary additional systemd slices and put stuff there. I don't have any example in our current puppet tree, but shouldn't be difficult I think.

Sorry all, I overlooked this ping.

elukey moved this task from In Progress to Backlog on the User-Elukey board.Apr 16 2019, 11:02 AM