I was looking at the toolforge alerts firing and noticed ToolforgeKubernetesCapacity in warning for memory since ~2d, I'm wondering though if we're in actual trouble and/or any action is warranted at this time. Specifically, looking at the linked dashboard while requests are at 80% actual usage if closer to ~40%.
Description
Description
Related Objects
Related Objects
Event Timeline
Comment Actions
Memory is tricky in that if your job hits the limit, or the host has no more free memory (if we overcommited and more than one job uses more than their request) then the pod is killed, and most tools (specially scheduledjobs) are not designed with that in mind.
I see some ways forward (just ideas, not exclusive):
- Reduce the default request
- Increase the number of workers (next ones should have 2x the mem iirc)
- Find a better signal to alert on
- Iirc, something we talked about was using the number of pods in pending state (as in, waiting for allocation, this helps with overall cluster usage, on the reservation side only though)
- Another signal might be counting the number of pods killed by OOM (ex. to notice pods that have memory peaks, this would give an idea on the actual usage/limit side)
Comment Actions
Thank you @dcaro for the pointer to T404726 ! I went through it again and it was a good read; I'm resolving this one in favor of T414513: Add new alerts for Toolforge cluster high load and will follow up there
