Page MenuHomePhabricator

[k8s,infra] scale up coredns replicas
Open, MediumPublic

Description

While investigating T333922: [k8s,infra] k8s control plane freezing and other stability issues we noticed that the Toolforge Kubernetes coredns pods are using quite a bit of CPU on the control plane nodes. To spread out that load we want to increase the number of replicas either by hand or by implementing autoscaling (T239404).

Event Timeline

dcaro renamed this task from toolforge: scale up coredns replicas to [k8s,infra] scale up coredns replicas.Jun 13 2024, 9:49 AM

We are currently having issues with the DNS resolution, though I suspect they are not load issues, let me try scaling up manually and see if that helps (and if so, we can setup autoscale)

CPU usage lowered from ~0.8 to ~0.5, running tests

Tests seem to be passing \o/

So it might have been load, though it was using 0.8 CPU, and had no limit, maybe we have to increase the request size (0.1CPU), currently is using <0.6 CPU (with 4 replicas), so k8s might throttle it too, I think both things would be the best (increase request + increase replica/autoscale)

Oh, tests started intermittently failing on the openapi.json endpoint, like in toolsbeta (failing to fetch it from builds-api):

   < HTTP/1.1 500 Internal Server Error                                                                                                                                    
   < Server: nginx/1.21.0                                                    
   < Date: Mon, 26 Aug 2024 08:26:36 GMT                           
   < Content-Type: application/json                    
   < Content-Length: 130                                                      
   < Connection: keep-alive                                                                                                                                                                                                                                                                                                                            
   <                                                                                                                                                                                                                                                                                                                                                   
   { [130 bytes data]                                                                                                                                                                                                                                                                                                                                  
100   130  100   130    0     0     25      0  0:00:05  0:00:05 --:--:--    34                                                                                                                                                                                                                                                                         
   * Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact                                                                                                                                                                                                                                                                            
   {                                                                                                                                                                                                                                                                                                                                                   
     "detail": "Connection error with backend API while fetching url https://builds-api.builds-api.svc.tools.local:8443/openapi.json."                                                                                                                                                                                                                 
   }
aborrero triaged this task as Medium priority.Sep 18 2024, 2:40 PM