While investigating T333922: [k8s,infra] k8s control plane freezing and other stability issues we noticed that the Toolforge Kubernetes coredns pods are using quite a bit of CPU on the control plane nodes. To spread out that load we want to increase the number of replicas either by hand or by implementing autoscaling (T239404).
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T333922 [k8s,infra] k8s control plane freezing and other stability issues | |||
Open | None | T333934 [k8s,infra] scale up coredns replicas | |||
Open | None | T239404 [k8s,infra] evaluate DNS (coredns) autoscale options |
Event Timeline
Comment Actions
We are currently having issues with the DNS resolution, though I suspect they are not load issues, let me try scaling up manually and see if that helps (and if so, we can setup autoscale)
Comment Actions
Tests seem to be passing \o/
So it might have been load, though it was using 0.8 CPU, and had no limit, maybe we have to increase the request size (0.1CPU), currently is using <0.6 CPU (with 4 replicas), so k8s might throttle it too, I think both things would be the best (increase request + increase replica/autoscale)
Comment Actions
Oh, tests started intermittently failing on the openapi.json endpoint, like in toolsbeta (failing to fetch it from builds-api):
< HTTP/1.1 500 Internal Server Error < Server: nginx/1.21.0 < Date: Mon, 26 Aug 2024 08:26:36 GMT < Content-Type: application/json < Content-Length: 130 < Connection: keep-alive < { [130 bytes data] 100 130 100 130 0 0 25 0 0:00:05 0:00:05 --:--:-- 34 * Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact { "detail": "Connection error with backend API while fetching url https://builds-api.builds-api.svc.tools.local:8443/openapi.json." }