Page MenuHomePhabricator

Stat hosts: evaluate effectiveness of oomd
Closed, DeclinedPublic

Description

Quoth @CDanis:

Several times a month, someone asks on Slack for help with runaway processes on a (sic) stat hosts. Usually, the system will be heavily overcommitted on RAM and stuck in a livelock spin cycle...

To improve the user experience when this happens, we added oomd to the stat hosts via this Puppet change.

Specifically, the hope is that oomd will take care of killing processes automatically so that users do not need to interrupt their work to ping SREs when the hosts are under memory contention.

Creating this ticket to:

  • Attempt to trigger oomd on the stat hosts
  • Record results - Does oomd have a prometheus exporter and/or logs we can examine via logstash?
  • Follow up as necessary (if it works: communicate the change to stat host users. If it doesn't: Decide whether or not to tweak oomd settings, or ignore it and wait until we've reimaged on a newer Debian OS and can use the newer and more-popular systemd-oomd for the same purpose).

Related Objects

StatusSubtypeAssignedTask
OpenNone
DeclinedNone

Event Timeline

Per this Puppet change , we are now allowing up to 50% of physical RAM to be used as zRAM swap. Again quoting cdanis:

zram doesn't consume any RAM until it is actually used, and its compression ratio is generally 2x-3x -- so this is just free.

The hope is that the host will now be able to stash far more anonymous pages in zRAM during periods of memory contention and thus remain more responsive under these conditions.

bking added a subscriber: Gehel.

Per Slack conversation with @Gehel , I'm closing this out as it doesn't look like we'll have time to do the comprehensive testing in the near future. We'll reopen if/when time permits.