notebook1004 (and probably other notebook servers) keep running out of memory every once in a while when a user runs some large jobs, for example R jobs. (there was a comment that R's approach to memory management is not very efficient)
When it runs out of memory this typically kills nagios-nrpe-server and that leads to all the monitoring checks via NRPE being broken which leads to IRC spam like this:
18:52 <+icinga-wm> PROBLEM - dhclient process on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused 18:52 <+icinga-wm> PROBLEM - MD RAID on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused 18:52 <+icinga-wm> PROBLEM - puppet last run on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused 18:52 <+icinga-wm> PROBLEM - Check systemd state on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused 18:52 <+icinga-wm> PROBLEM - configured eth on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused 18:52 <+icinga-wm> PROBLEM - DPKG on notebook1004 is CRITICAL: connect to address 10.64.36.107 port 5666: Connection refused
Restarting nagios-nrpe-server leads to recovery until moments later the same thing happens again.
In this specific case i have notified users with "echo | wall" and it actually worked but it seems it needs a permanent solution with quota or some other way to ensure users can't use all of the memory.