While investigating T333922: [k8s,infra] k8s control plane freezing and other stability issues we noticed that the Toolforge Kubernetes coredns pods are using quite a bit of CPU on the control plane nodes. To spread out that load we want to increase the number of replicas either by hand or by implementing autoscaling (T239404).
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T333922 [k8s,infra] k8s control plane freezing and other stability issues | |||
| Open | None | T333934 [k8s,infra] scale up coredns replicas | |||
| Open | None | T239404 [k8s,infra] evaluate DNS (coredns) autoscale options |
Event Timeline
We are currently having issues with the DNS resolution, though I suspect they are not load issues, let me try scaling up manually and see if that helps (and if so, we can setup autoscale)
Tests seem to be passing \o/
So it might have been load, though it was using 0.8 CPU, and had no limit, maybe we have to increase the request size (0.1CPU), currently is using <0.6 CPU (with 4 replicas), so k8s might throttle it too, I think both things would be the best (increase request + increase replica/autoscale)
Oh, tests started intermittently failing on the openapi.json endpoint, like in toolsbeta (failing to fetch it from builds-api):
< HTTP/1.1 500 Internal Server Error
< Server: nginx/1.21.0
< Date: Mon, 26 Aug 2024 08:26:36 GMT
< Content-Type: application/json
< Content-Length: 130
< Connection: keep-alive
<
{ [130 bytes data]
100 130 100 130 0 0 25 0 0:00:05 0:00:05 --:--:-- 34
* Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact
{
"detail": "Connection error with backend API while fetching url https://builds-api.builds-api.svc.tools.local:8443/openapi.json."
}Kubernetes 1.31 adds support for applying patches to the CoreDNS deployment that could be used for this: https://github.com/kubernetes/kubeadm/issues/3033