Today on bast3002, Prometheus did something (possibly a very large query?) to get oom-killed
Jun 27 20:05:32 bast3002 kernel: [812018.018273] Killed process 3745 (prometheus) total-vm:50336320kB, anon-rss:7298804kB, file-rss:0kB, shmem-rss:0kB
In the meanwhile load on bast3002 was 120+ and it was unusable as a bastion.
Maybe we should run Prometheus on these machines inside a memory-limited cgroup? The limit should still be quite high, since Prometheus will need a lot of RAM sometime, but getting the into machine-OOM-killer state isn't ideal.