Incident
On a recently deploy of patchdemo, the deployment stalled trying to schedule the new pods. We saw the following error:
Warning FailedScheduling 41m default-scheduler 0/2 nodes are available: 1 Too many pods. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..
We were able to resolve this issue by deleting all of the patchdemo-staging environments, and catalyst environments that were more than two weeks old and named test-. (Very sorry if you were using those 😅 )
Configured (default) limits
Our current pod limit for the main k3s node is set to the default recommended limit of 110
kubectl get node k3s -ojsonpath='{.status.capacity.pods}'
110Implications
When we reach the 110 limit (again), new pods will not be scheduled until some pods are removed. This means that:
- patchdemo or catalyst cannot be upgraded/deployed
- catalyst will be unable to create new demos
- scheduled jobs (like the repo-pool updater and the expiry checker) will not run
Possible remediation
While this limit can be increased, the kubernetes documentation recommends against this, and recommends adding a new node to the cluster instead.