Quoth @CDanis:
Several times a month, someone asks on Slack for help with runaway processes on a (sic) stat hosts. Usually, the system will be heavily overcommitted on RAM and stuck in a livelock spin cycle...
To improve the user experience when this happens, we added oomd to the stat hosts via this Puppet change.
Specifically, the hope is that oomd will take care of killing processes automatically so that users do not need to interrupt their work to ping SREs when the hosts are under memory contention.
Creating this ticket to:
- Attempt to trigger oomd on the stat hosts
- Record results - Does oomd have a prometheus exporter and/or logs we can examine via logstash?
- Follow up as necessary (if it works: communicate the change to stat host users. If it doesn't: Decide whether or not to tweak oomd settings, or ignore it and wait until we've reimaged on a newer Debian OS and can use the newer and more-popular systemd-oomd for the same purpose).