@dcaro asked me to file a task about this well-known issue in the Linux kernel after I ranted about it on IRC.
When demand for memory on a system exceeds the amount which is physically available, Linux will enter a live-lock state, in which everything becomes pathologically slow. Not only the request causing the issue, but also metrics, sshd, rsyslogd, etc. The whole system can stall for ~30 minutes, so severely that it is indistinguishable from hardware failure.
I researched this issue. Lately (2018), engineers at Facebook have made progress in analysing and addressing it.
Swap
Chris Down suggests enabling swap. This helps to reduce excessive I/O due to reclaimed file caches, by allowing rarely-used anonymous pages to be swapped out instead. The descent into a stall is slower and more gentle, although the OOM killer may be invoked later.
I think I, and a lot of other older engineers, emerged from the era of spinning rust swap and inadequate RAM, scarred from the experience and resolving to never repeat it again. Once we had enough RAM, we turned off swap and hoped to never have to deal with it again. Chris Down does a good job of explaining why it's better than it used to be and why it was never really the problem in the first place.
oomd
Facebook engineers implemented improved metrics for detection of a live-lock situation in the kernel, known as pressure stall information. Then they implemented a userspace daemon called oomd, or systemd-oomd, which watches this metric and acts on it by killing cgroups.
Facebook says that after deploying this service, they "have seen 30-minute livelocks completely disappear."
According to the documentation, for this to work properly, swap should be enabled, and work should be divided into separate cgroups so that not too many processes are killed at a time.
Application memory limits
Aside from innovations by Facebook, I want to note that systemd and cgroups gives us new ways to deal with this old problem. It is possible to use a systemd unit override file to set a memory limit for a service process and its descendants. This allows us to reserve some memory for essential system services like sshd and sssd (which may otherwise be killed as I noted in T324934).
On wsexport-prod02, I tried adding the following file at /etc/systemd/system/php8.2-fpm.service.d/limit.conf:
[Service] MemoryMax=85% OOMPolicy=continue Restart=on-failure
In my testing, this prevented a full-system stall.
