We have been hitting the limit a couple times in the last few days, we should expand the cluster a bit.
We might consider using a bigger VM for the new workers too to give better chance to run bigger jobs too.
Things to clarify:
- What flavor of nodes are hitting the limit
- How many workers to add
- What flavor/vm size to use
- It's only cpu or also mem? (should we change the ratios for the worker VMs?)
Limits proposal
For defaults:
- cpu/request -> 100m (applied already)
- cpu/limit -> 1cpu (applied already)
- memory/request -> 512Mi (current value)
- memory/limit -> 512Mi (current value)
For user set values (they can only specify --cpu or --mem):
- cpu/request + cpu/limit = user set value
- mem/request + mem/limit = user set value
Alerts porposal
- page: If user's can't schedule workload
- measured by something like:
sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s", phase="Pending"}) / sum (kube_pod_status_phase{job="k8s-pods", prometheus="k8s"}) > 0.1- page: If user's workloads are being widely killed
- measured by the kube_pod_container_status_terminated_reason increase over time (ex. if there's a sustained peak, values to tweak with experience)
- warning: If the overall cluster load (cpu/mem used) is very high for long
- measured in the span of a day, if it gets over 80% or any of those, with the recommendation double check and scale it up
