Page MenuHomePhabricator

Add new alerts for Toolforge cluster high load
Open, MediumPublic

Description

Some new alerts were proposed in T404726: [tools,infra,k8s] scale up the cluster, specifically CPU but never implemented. Splitting this proposal to a new task so I can resolve T404726.

Alerts proposal

  • page: If user's can't schedule workload
    • measured by something like:
sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s", phase="Pending"}) / sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s"}) > 0.1
  • page: If user's workloads are being widely killed
    • measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
  • warning: If the overall cluster load (cpu/mem used) is very high for long
    • measured in the span of a day, if it gets over 80% or any of those, with the recommendation double check and scale it up

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Following up from T419674: ToolforgeKubernetesCapacity alert actionability

Memory is tricky in that if your job hits the limit, or the host has no more free memory (if we overcommited and more than one job uses more than their request) then the pod is killed, and most tools (specially scheduledjobs) are not designed with that in mind.

I see some ways forward (just ideas, not exclusive):

  • Reduce the default request

It would also be interesting I think to see how the distribution of tool memory request vs usage looks like in terms of understanding what a good default request looks like.

  • Increase the number of workers (next ones should have 2x the mem iirc)

Given that memory requests creep up over time (75% a month ago, 81% now) and when we hit 100% we can't schedule new workloads, then I think in the immediate term adding more workers for headroom is the right thing to do IMHO.

  • Find a better signal to alert on
    • Iirc, something we talked about was using the number of pods in pending state (as in, waiting for allocation, this helps with overall cluster usage, on the reservation side only though)
    • Another signal might be counting the number of pods killed by OOM (ex. to notice pods that have memory peaks, this would give an idea on the actual usage/limit side)

I'm definitely +1 on at least trying both of these and see how much signal we get

Something we can improve there also is that they are metrics with a big cardinality (has the pod/namespace and such), so maybe we want to aggregate them somehow at the prometheus level (that might be useful also for other reasons, like storing the count of tools for very long, and others, unrelated to this task though).