Some new alerts were proposed in T404726: [tools,infra,k8s] scale up the cluster, specifically CPU but never implemented. Splitting this proposal to a new task so I can resolve T404726.
Alerts proposal
- page: If user's can't schedule workload
- measured by something like:
sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s", phase="Pending"}) / sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s"}) > 0.1- page: If user's workloads are being widely killed
- measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
- warning: If the overall cluster load (cpu/mem used) is very high for long
- measured in the span of a day, if it gets over 80% or any of those, with the recommendation double check and scale it up