Page MenuHomePhabricator

toolforge: new k8s: evaluate ingress controller reload behaviour
Closed, ResolvedPublic

Description

@Bstorm has some concerns on the reaload times for our current ingress implementation in the new k8s cluster nginx-ingress.

My biggest concern about the
migration is that an ingress for each web service will cause problems in the
controller on reload. At that point, we need to either use dynamic proxy +
calico routing or a different ingress that scales better (supposedly haproxy and
Traefik might). The scaling I worry about is purely the reload time. It might
get to a point where the reload time is very long and leads to some kind of
downtime. It may also be no problem at all!

then:

Also I realized we cannot dynamically autoscale the controllers without
something like ECMP Anycast routing using Calico and labeling (which would
basically turn the services into a legit basic load balancer—if it can be made
dynamic enough, it might not—then just statically scale it out a bit and still
use ECMP Anycast across them). So overall, I think testing calico BGP magic at
the proxy level in toolsbeta will be a good idea. I might jump on that, but
maybe we should try creating a bunch of tools with Bryan’s script there *first*
and see how the ingress responds with more web services running and restarting,
etc. You could jump on that in the next couple days even if you want.

I suggest the first step is to really try and see how bad reload times are for a big number of services and ingresses.

Actionables:

  • check numbers for config reload times
  • check whether webtraffic is served while on reload
  • check pod scalation for nginx-ingress

Event Timeline

aborrero triaged this task as Medium priority.
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Some initial simple tests, with a single nginx-ingress pod, a fake ingress object that points to nowhere (non existent svc or endpoint):

  • with 100 ingress objects, config reload takes about ~0.5s
  • with 1k ingress objects, config reload takes about ~2s
  • with 5k ingress objects, config reload takes about ~10s
  • with 10k ingress objects, config reload takes about ~20s

I measured this reading log entries:

$ kubectl logs -n ingress-nginx nginx-ingress-5dbf7cb65c-zkdqk | grep reload
[..]
I1129 16:05:44.038412       8 controller.go:133] Configuration changes detected, backend reload required.
I1129 16:06:02.655508       8 controller.go:149] Backend successfully reloaded. 

This suggest reload times are linear and directly related to the number of ingress objects.

Will test next if having more nginx-ingress pods or actual destination svc/pod makes it any different. Also, what happens while the config is being reloaded (is ingress traffic affected?).

With an actual destination svc/pod, we get similar reload times. I can confirm there is no downtime for current traffic while the reload is taking place, ie:

I1129 16:46:03.178284       6 controller.go:133] Configuration changes detected, backend reload required.
192.168.44.192 - [192.168.44.192] - - [29/Nov/2019:16:46:03 +0000] "GET /fourohfour HTTP/1.1" 503 203 "-" "curl/7.64.0" 168 0.000 [tool-fourohfour-fourohfour-8000] [] - - - - 2193fc47bc7a1f042506b319557a2bc0
I1129 16:46:06.154714       6 controller.go:149] Backend successfully reloaded.

Regarding the number of nginx-ingress pods, I have some comments. We may not need any autoscaling mechanism. Let's assume we use the magic static number of 5 nginx-ingress pods.

  • If we have very little usage on the cluster, this is no problem. We are not paying anything per use. We can afford having 5 pods doing nothing. No need to downscale. The magic number works.
  • If we suddenly have very high webservices load in the cluster. This is indicative of other kind of problems. We may need to manually scale the cluster itself (workers), not just the number of nginx-ingress pods. Adjusting the magic number is just a very minor step in that scaling workflow. So the magic number works in this case too.
  • We have currently 30 worker nodes in the legacy cluster. A magic number of 5 nginx-ingress pods mean 1 pod per 6 worker nodes. I think nginx can handle this.

This is to say, investigating fancy autoscaling mechanisms may not be worth it in our particular case.

If we do the cluster proportional autoscaler for DNS, we could consider it as well for ingress, but it sounds like we are going to be in good shape.

Mentioned in SAL (#wikimedia-cloud) [2019-12-02T10:34:16Z] <arturo> manually scale nginx-ingress deployment to 5 replicas (T239405)

Oh, some more news. I scaled the deployment to 5 pods and created 20k ingress objects and the whole thing crashed.
I was not able to curl the test tool we have and I was unable to fetch ingress-nginx logs. Pods crashed and were re-created by k8s.

Some information about the failure was present in the logs but I had a hard time capturing it:

root@toolsbeta-test-k8s-control-3:~# kubectl logs nginx-ingress-5d586d964b-n5k2c -n ingress-nginx --since=10m -f | grep reload 
I1202 11:46:12.544600       6 controller.go:133] Configuration changes detected, backend reload required.
^[[DI1202 11:47:50.292152       6 controller.go:149] Backend successfully reloaded.
I1202 11:48:17.187262       6 controller.go:133] Configuration changes detected, backend reload required.
E1202 11:55:35.832255       6 controller.go:145] Unexpected failure reloading the backend:

I had to inspect files by hand:

{"log":"W1202 11:57:40.872288       7 template.go:120] unexpected error cleaning template: signal: terminated\n","stream":"stderr","time":"2019-12-02T11:57:41.023062045Z"}
{"log":"E1202 11:57:41.029289       7 controller.go:145] Unexpected failure reloading the backend:\n","stream":"stderr","time":"2019-12-02T11:57:41.059781651Z"}
{"log":"\n","stream":"stderr","time":"2019-12-02T11:57:41.060110095Z"}
{"log":"-------------------------------------------------------------------------------\n","stream":"stderr","time":"2019-12-02T11:57:41.06011734Z"}
{"log":"Error: exit status 1\n","stream":"stderr","time":"2019-12-02T11:57:41.060436884Z"}
{"log":"2019/12/02 11:57:41 [emerg] 39#39: \"proxy_next_upstream_timeout\" directive is not allowed here in /tmp/nginx-cfg048620309:1\n","stream":"stderr","time":"2019-12-02T11:57:41.060445243Z"}
{"log":"nginx: [emerg] \"proxy_next_upstream_timeout\" directive is not allowed here in /tmp/nginx-cfg048620309:1\n","stream":"stderr","time":"2019-12-02T11:57:41.060451114Z"}
{"log":"nginx: configuration file /tmp/nginx-cfg048620309 test failed\n","stream":"stderr","time":"2019-12-02T11:57:41.060456615Z"}
{"log":"\n","stream":"stderr","time":"2019-12-02T11:57:41.060461807Z"}
{"log":"-------------------------------------------------------------------------------\n","stream":"stderr","time":"2019-12-02T11:57:41.06046672Z"}

I noticed the worker nodes were totally overwhelmed from the CPU load point of view, load average: 15.78, 21.53, 19.47. Note the toolsbeta workers are smaller than tools workers (2 vCPU in toolsbeta vs 4 vCPU tools).

Will keep investigating.

good news, donwscaling the nginx-ingress to 2 pod replicas and creating again 20k ingress objects and the issue above didn't show up again. Our magic number could be 2 perhaps.

BTW, the reload time with 20k ingress objects with 2 pods is about ~90s:

I1202 12:54:04.426763       6 controller.go:133] Configuration changes detected, backend reload required.
I1202 12:55:30.994381       6 controller.go:149] Backend successfully reloaded.

I just gave this another tested: 3 pods with 20k ingress objects. It works pretty well. I'm used that as the replica factor for now.

Change 556184 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toolforge: new k8s: set up 3 nginx-ingress pod replicas

https://gerrit.wikimedia.org/r/556184

Change 556184 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toolforge: new k8s: set up 3 nginx-ingress pod replicas

https://gerrit.wikimedia.org/r/556184

Mentioned in SAL (#wikimedia-cloud) [2019-12-10T13:59:45Z] <arturo> set pod replicas to 3 in the new k8s cluster (T239405)

@Bstorm please give this a final review.

I'm proposing to use 3 nginx-ingress pods replicas in the initial iteration. We can see how it behaves when we start having actual workload and ingress objects. I'm more or less confident with this magic number.

If you agree with everything we can close this task and move on.

There's something going on in toolsbeta that is relatively fatal. worker-1 is offline and the pods are having issues.

2m45s       Normal    Scheduled                pod/nginx-ingress-5d586d964b-2tj9b    Successfully assigned ingress-nginx/nginx-ingress-5d586d964b-2tj9b to toolsbeta-test-k8s-worker-2
2m10s       Normal    Pulled                   pod/nginx-ingress-5d586d964b-2tj9b    Container image "docker-registry.tools.wmflabs.org/nginx-ingress-controller:0.25.1" already present on machine
2m7s        Normal    Created                  pod/nginx-ingress-5d586d964b-2tj9b    Created container nginx-ingress-controller
111s        Normal    Started                  pod/nginx-ingress-5d586d964b-2tj9b    Started container nginx-ingress-controller
86s         Warning   Unhealthy                pod/nginx-ingress-5d586d964b-2tj9b    Readiness probe failed: HTTP probe failed with statuscode: 500
83s         Warning   Unhealthy                pod/nginx-ingress-5d586d964b-2tj9b    Liveness probe failed: HTTP probe failed with statuscode: 500
2m45s       Normal    Scheduled                pod/nginx-ingress-5d586d964b-7z9vm    Successfully assigned ingress-nginx/nginx-ingress-5d586d964b-7z9vm to toolsbeta-test-k8s-worker-2
55s         Normal    Pulled                   pod/nginx-ingress-5d586d964b-7z9vm    Container image "docker-registry.tools.wmflabs.org/nginx-ingress-controller:0.25.1" already present on machine
53s         Normal    Created                  pod/nginx-ingress-5d586d964b-7z9vm    Created container nginx-ingress-controller
49s         Normal    Started                  pod/nginx-ingress-5d586d964b-7z9vm    Started container nginx-ingress-controller
3s          Warning   Unhealthy                pod/nginx-ingress-5d586d964b-7z9vm    Readiness probe failed: HTTP probe failed with statuscode: 500
14s         Warning   Unhealthy                pod/nginx-ingress-5d586d964b-7z9vm    Liveness probe failed: HTTP probe failed with statuscode: 500
14s         Normal    Killing                  pod/nginx-ingress-5d586d964b-7z9vm    Container nginx-ingress-controller failed liveness probe, will be restarted
9m11s       Warning   Unhealthy                pod/nginx-ingress-5d586d964b-fzskx    Liveness probe failed: Get http://192.168.44.230:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I'm seeing it in a few places and not sure why. A calico-node daemonset also failed the same way. It may just be a bad node somehow? Working on it. It also could just be that this scale is too big for that cluster in toolsbeta. I haven't checked tools yet.

Yes, everything is fine in tools. This is an issue with the toolsbeta cluster only :)

Looking at the console of toolsbeta-test-k8s-worker-1 the reason for the issue is obvious:

[3995835.659115] Out of memory: Kill process 4208 (nginx) score 1190 or sacrifice child
[3995835.673139] Killed process 29388 (nginx) total-vm:846988kB, anon-rss:767948kB, file-rss:0kB, shmem-rss:0kB

So this is kind of cool. We are able to see how it behaves when things are not great.

I'm spinning up a large size node so that it can cope with this scale.

That fixed the cluster :) So that was kind of neat to see and fix.

I will say that the nginx pods are pretty hungry little monsters. We may want to eventually set up a node affinity for them with some reserved workers just for them. For now, it seems like it is all working. Btw, toolsbeta may have been ok when you left it. I started up another tool for testing something else, and that may have been enough to take down a node and start a bad loop of crashing.

I am going to take a look at what the nginx controller pods ask for in terms of CPU and mem. They may be better off if they ask for more to begin with, thus not deploying unless there is more capacity available.

<returns from checking puppet> That's the problem. The deployment doesn't list any resource request. We should update modules/toolforge/files/k8s/nginx-ingress.yaml to include a request for cpu and memory that we guess is correct. That would prevent the pod from being deployed if the minimum request isn't available.

Something like this in the containers spec:

resources:
  requests:
    cpu: "0.5"
    memory: "1Gi"

Hrm. We can't run kubectl top to find out what a container is using because heapster/metrics-server isn't set up. That's interesting, and we really should have it set up. We might be able to get that from the pod or host. Basically, though, if we set this, the scheduler will make sure the pods are something the node can take.

Everything else looks awesome, and I think we should be great with three pods in the deployment.

Based on the metrics-server output, looks like I guessed a pretty good "request" value (and might try a patch) :) That's at rest and small scale, of course. It wouldn't hurt to provide it an upper limit as well, but I don't know what a good limit would be, and maybe that's where the affinity would be smart.

Change 556404 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] nginx-ingress: Have ingress pods request realistic resting resources

https://gerrit.wikimedia.org/r/556404

Change 556404 merged by Bstorm:
[operations/puppet@production] nginx-ingress: Have ingress pods request realistic resting resources

https://gerrit.wikimedia.org/r/556404

I think this can be closed. Please reopen if required.