Page MenuHomePhabricator

OOM livelock stalls
Open, Stalled, MediumPublic

Description

@dcaro asked me to file a task about this well-known issue in the Linux kernel after I ranted about it on IRC.

When demand for memory on a system exceeds the amount which is physically available, Linux will enter a live-lock state, in which everything becomes pathologically slow. Not only the request causing the issue, but also metrics, sshd, rsyslogd, etc. The whole system can stall for ~30 minutes, so severely that it is indistinguishable from hardware failure.

I researched this issue. Lately (2018), engineers at Facebook have made progress in analysing and addressing it.

Swap

Chris Down suggests enabling swap. This helps to reduce excessive I/O due to reclaimed file caches, by allowing rarely-used anonymous pages to be swapped out instead. The descent into a stall is slower and more gentle, although the OOM killer may be invoked later.

I think I, and a lot of other older engineers, emerged from the era of spinning rust swap and inadequate RAM, scarred from the experience and resolving to never repeat it again. Once we had enough RAM, we turned off swap and hoped to never have to deal with it again. Chris Down does a good job of explaining why it's better than it used to be and why it was never really the problem in the first place.

oomd

Facebook engineers implemented improved metrics for detection of a live-lock situation in the kernel, known as pressure stall information. Then they implemented a userspace daemon called oomd, or systemd-oomd, which watches this metric and acts on it by killing cgroups.

Facebook says that after deploying this service, they "have seen 30-minute livelocks completely disappear."

According to the documentation, for this to work properly, swap should be enabled, and work should be divided into separate cgroups so that not too many processes are killed at a time.

Application memory limits

Aside from innovations by Facebook, I want to note that systemd and cgroups gives us new ways to deal with this old problem. It is possible to use a systemd unit override file to set a memory limit for a service process and its descendants. This allows us to reserve some memory for essential system services like sshd and sssd (which may otherwise be killed as I noted in T324934).

On wsexport-prod02, I tried adding the following file at /etc/systemd/system/php8.2-fpm.service.d/limit.conf:

[Service]
MemoryMax=85%
OOMPolicy=continue
Restart=on-failure

In my testing, this prevented a full-system stall.

Event Timeline

Focusing on the swap part of the problem, for posterity:

I think it's a valid point for backend/async processing systems or systems that have a lot of noisy neighbours and are not latency-critical.

I don't think it holds any ground for systems involved in live responses or which have strict latency requirements in general.

For instance, enabling swap might save a database from being OOM killed, but it will slow down databases to a halt, basically causing it to be unusable to serve live requests.

That holds even more true for large load-balanced pools of applications (like mediawiki or the other services we run) where simply swapping doesn't make sense because our goal is not to prevent OOMs completely, but to prevent any kind of latency degradation.

I also want to note that on kubernetes memory is mostly managed by the k8s scheduler on top of the kernel one, so that we never have overflowing use of memory and we OOM containers (which are nothing more than cgroups) to control we never do. And the k8s scheduler also takes care of never allocating a portion of memory we reserver for system component, and we currently do so.

So what Tim is suggesting to do with systemd to reserve a system portion of memory is surely something I would recommend we look into.

@tstarling thanks for the task! :), I was sitting a couple of desks away from Cris in London when he wrote that post xd, it circulated widely among production engineering

To clarify, this task is to request enabling it on CloudVPS instances by default, or to enable it in wiki production machines? (or both?)

For CloudVPS instances, where the usage is really heterogeneous, I think it might be worth it yes, though as the article says, it is really dependent on the workload and memory usage of the application running in there, so of course it will depend, but for generic applications and servers running a mixture of services, having some swap might be useful performance-wise.

Now, for the oom, we can play with oomd and similar, but it might be more than what we can do at the CloudVPS level by default, and need tuning on the user of the platform, as they have to define the cgroups to match the services and such, so it's a half-solution in general, but having that option as a user would be nice I think, so we can somewhat add support for users to use it if they wont (now, as that's not use on wikiland side, the modules would be solely maintained for cloud, so there's less eyes and usage).

I don't think it holds any ground for systems involved in live responses or which have strict latency requirements in general.

For instance, enabling swap might save a database from being OOM killed, but it will slow down databases to a halt, basically causing it to be unusable to serve live requests.

That's not true, according to the article I linked by Chris Down. If a database server is completely out of memory, it's better to swap than to go into a livelock, a state so severely broken that the power-cycling is seen as a reasonable solution. But either way, the problem is that you're running out of memory, not that you have swap.

Under light memory pressure, enabling swap allows more RAM to be used for file caches, improving performance.

That holds even more true for large load-balanced pools of applications (like mediawiki or the other services we run) where simply swapping doesn't make sense because our goal is not to prevent OOMs completely, but to prevent any kind of latency degradation.

The whole point of the article is that enabling swap reduces latency. He goes into a lot of detail about why this is. With spinning rust swap, he says, the benefit is smaller. With SSDs, the benefit is larger.

To clarify, this task is to request enabling it on CloudVPS instances by default, or to enable it in wiki production machines? (or both?)

I wanted to share my thoughts with both production and CloudVPS teams. You can use this task as a parent task for tasks proposing specific actions aimed at reducing OOM livelock stalls. I would suggest enabling swap on CloudVPS instances by default.

Now, for the oom, we can play with oomd and similar, but it might be more than what we can do at the CloudVPS level by default, and need tuning on the user of the platform, as they have to define the cgroups to match the services and such, so it's a half-solution in general, but having that option as a user would be nice I think, so we can somewhat add support for users to use it if they wont (now, as that's not use on wikiland side, the modules would be solely maintained for cloud, so there's less eyes and usage).

On wsexport-prod02 at least, it seems every service is in its own cgroup already (P58166). Systemd is doing that by default. So maybe it would just work without cgroup tuning. When there is an OOM related to a web request, the whole webserver would be killed, rather than just the request, but maybe that's OK as long as it's set to restart automatically. Better than a livelock, right?

wsexport-prod02 is a reliable OOM generator, and I'd be happy to test it there once we have swap.

I don't think it holds any ground for systems involved in live responses or which have strict latency requirements in general.

For instance, enabling swap might save a database from being OOM killed, but it will slow down databases to a halt, basically causing it to be unusable to serve live requests.

That's not true, according to the article I linked by Chris Down. If a database server is completely out of memory, it's better to swap than to go into a livelock, a state so severely broken that the power-cycling is seen as a reasonable solution. But either way, the problem is that you're running out of memory, not that you have swap.

Under light memory pressure, enabling swap allows more RAM to be used for file caches, improving performance.

What the article said and what I said are not in contrast. The article (which is not revolutionary or explaining anything new, btw...) just says that the slowdown is more severe if you have no swap and you end up under extreme memory pressure.

What I stated is that even a minor slowdown in a database serving live traffic will quickly devolve into a situation where the database server is unusable. If anything, having it crash more quickly will do us a good service.

We're more ok with having to take more drastic measures once a database is out of rotation, like a hard reboot, than to have prolonged periods of time in which it responds very slow.

That holds even more true for large load-balanced pools of applications (like mediawiki or the other services we run) where simply swapping doesn't make sense because our goal is not to prevent OOMs completely, but to prevent any kind of latency degradation.

The whole point of the article is that enabling swap reduces latency. He goes into a lot of detail about why this is. With spinning rust swap, he says, the benefit is smaller. With SSDs, the benefit is larger.

Again, the article talks about reducing the penalty incurred because of memory pressure, not avoiding it. When you're ok losing a server and having it out of rotation and just reboot it, you're doing yourself a disservice.

I fully agree that for cloud VPS and for single-server services enabling swap makes sense (and it made sense before we had fast disks too).

I'll give this one more go since I think we're pretty close to a shared understanding. Here's what I think the graph of latency versus memory pressure looks like:

oom sketch T358634.png (1×1 px, 88 KB)

I think you are saying we should stay on the left hand side of the line labelled "file cache pressure", at which the kernel begins to reclaim less recently used file caches or swap out anonymous pages. At this point, performance begins to degrade since caches which are still occasionally used are evicted in favour of more recently used pages.

I am saying that there is no point on the chart at which disabling swap is a rational choice, so we may as well enable it.

What I stated is that even a minor slowdown in a database serving live traffic will quickly devolve into a situation where the database server is unusable.

I think many of the database servers are already experiencing file cache pressure and would already benefit from having swap. 58 of 126 servers in the codfw mysql cluster have less than 1% free memory, indicating that they have already evicted file caches. (P58272) They are already past the divergence point in my illustration.

Evicting file caches is just normal operation. The site is not down.

If anything, having it crash more quickly will do us a good service.

By "crash" do you mean the whole system becomes unresponsive, stops reporting metrics, stops allowing SSH logins, stops writing to the syslog? I would say that such a situation is not conducive to timely debugging of root causes.

Better to have swap in/out metrics be elevated while the server is still responsive enough to figure out what is going on.

jijiki changed the task status from Open to Stalled.Apr 2 2025, 11:46 AM
Andrew triaged this task as Medium priority.Apr 2 2025, 2:04 PM
fgiunchedi subscribed.

I'm untagging cloud VPS since this doesn't seem to be cloud vps specific, feel free to tweak as needed though