Page MenuHomePhabricator

Proposal: simplify set up of a new load-balanced service on kubernetes
Open, MediumPublic

Description

To deploy a new service in production on kubernetes right now there is a set of thing that need to be done. Marked as [SRE] or [service owner] in the list below

Group A: deployment of the service on kubernetes

  1. set up the appropriate set of values in helmfiles in deployment-charts [service-owner]
  2. set up the user token/credentials and other private data in the private puppet repository [SRE]
  3. set up the corresponding namespace on all clusters [SRE]
  4. deploy the service to all clusters [service-owner]

%Group B is all SRE actions.

Group B: Setting up LVS (load balancer)

  1. Add the new service to every kube worker in conftool-data, discovery
  2. Add the service IP to all the kube workers on loopback
  3. Add the service to DNS records (both normal and discovery)
  4. Fill in the service metadata to change the LVS configuration
  5. Restart all the relevant LVS
  6. Switch the monitoring of these endpoints to critical: true in puppet to add paging

While optimizing / eliminating any of these steps is nice, the biggest time-sink is without doubt the setup of LVS. It's long, complicated, failure prone, and only a handful of SREs are confident around the process.

How can we make that process better? In the reminder of this task I'll describe a few possible approaches.

Set up an ingress

This means allowing service owners to setup kubernetes resources that are specifically tailored to load-balancing and routing traffic incoming from the exterior of the cluster. It provides a pretty simple interface to manage externally. Unlike other proposals below, this is a L7 loadbalancer meaning it understands TLS, virtualhosts, HTTP etc.

Depending on where the implementation software is installed we have the following to paths. Both are valid approaches.

Inside kubernetes cluster

It means a request coming from the public would go as follows:

client => LB (pybal) => kube worker => kube-proxy (not a real hop, but does DNAT) => ingress => pods

Outside the kubernetes cluster

client => LB (pybal) => ingress => pods

This would add one node to the chain of proxying, and more moving parts. We would need to investigate Ingress solutions once more as we 've done in T170121, which was ~2,5 years ago. Things have changed since then.

How would Group B actions look like
  1. Add a cname to the ingress.
  2. Add some monitoring/alerting

That's it. A simple puppet patch and a simple dns patch. For very large services, maybe add a per-namespace setup.

Pros/cons

pros:

  • Integrated with kubernetes
  • industry "standard"
  • the rest of the infrastructure is left unmodified
  • L7 functionality.

cons

  • more moving parts
  • Ingress quality/stability at scale needs to be evaluated.
  • yet more complexity in our charts
  • Adding a potential SPOF (that is in some aspects addressable) as well as a potential chokepoint.
  • Essentially HTTP only. Specific implementations may support more protocols but overall the Ingress resources wasn't designed for this.

Modify pybal to autoconfigure pools from k8s

We already have a pybal patch ensuring we can fetch which workers are active from k8s instead of etcd, but we could expand it further to read /all/ data pybal needs from k8s, including pods. We would still need a consistent way to add IPs to the Load balancers and the k8s nodes, but that can be mostly done with some additional improvements [citation needed].

The flow of requests would be client => LB (pybal) => pod (via kube-proxy)

How would Group B actions look like
  1. Add the dns record for the new service
  2. Add the realserver IP on the kubernetes workers and the load balancers
  3. Let pybal add the configuration when the service is properly annotated.
  4. Add monitoring/alerting.

Three relatively simple patches (one to DNS, two to puppet). Some coordination is needed.

Pros/cons

pros:

  • no change to our current setup
  • Known unknowns. Pybal is mostly 'boring'
  • LVS-DR

cons:

  • Still an invented here solution
  • Not fully automated service addition, will still need to add IPs to the backends somehow.
  • Will need significant dev effort
  • Lack of L7 support

kube-proxy + bird

In this hypothesis, we'd have kube-proxy doing all the load-balancing, and announcing the LVS IPs via bird directly.

In this case, we'd have the simplest request flow:
client => kube-proxy => pod

In this hypothesis, we should configure some bgp daemon depending on which IPs we have configured on k8s, and run it as a sidekick of kube-proxy. One of the complications of this is that calico relies on running bird on each kubernetes nodes, so we 'd either have to setup kube-proxy+bird outside the cluster (mostly ending up resembling LVS), or we would need to figure out how to augment calico's bird configuration if we want to host it on the workers. This is essentially a variant of the pybal approach above.

How would Group B actions look like
  1. Add the dns record for the new service
  2. Add the realserver IP on the kubernetes workers (this can probably be automated by using annotations in the k8s api, but is it worth it?)
  3. Add monitoring/alerting.

Three relatively simple patches (one to DNS, two to puppet).

Pros/cons

pros:

  • least hops for a request
  • No additional moving parts besides bird
  • Overall the simplest configuration

cons:

  • Unknown cost of working on a solid bgp announcement system.
  • Might need additional configuration to know which IPs to serve
  • Lack of L7 support
  • No LVS-DR

Refactor all the setup of LVS across dns and puppet

It's probably possible to simplify the steps to set up a load-balanced service by rationalizing the puppet code around it (for example, synchronizing systems across various stages, or allowing to add a new service all in a patch and not in 3 different ones).

pros

  • no new technology would be introduced in production

cons

  • No clear implementation idea
  • Might never achieve a fully streamlined solution

Event Timeline

Joe triaged this task as Medium priority.Nov 22 2019, 9:42 AM
Joe created this task.

First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.

From my POV there's great value in having a single solution for load balancing across the infra (i.e. pybal + ipvs-dr today) in terms of routine operations (e.g. pool/depool) and cognitive load (all services are load balanced the same way).

Since we'll keep provisioning/decommissioning load balanced services (that live both inside outside k8s) via puppet I think it makes sense to focus on a combination of "refactor all the setup of lvs across dns and puppet" and "modify pybal to autoconfigure pools from k8s". The idea being to bring closer together what we're currently doing for load balanced services in puppet and the abstractions/ideas that k8s has.

From my POV there's great value in having a single solution for load balancing across the infra (i.e. pybal + ipvs-dr today) in terms of routine operations (e.g. pool/depool) and cognitive load (all services are load balanced the same way).

In practice, pybal does little for kubernetes nodes and OTOH it's mostly just adding a level of indirection. Most real load-balancing is actually performed by kubernetes itself inside the cluster. But I get your point in general.

Since we'll keep provisioning/decommissioning load balanced services (that live both inside outside k8s) via puppet I think it makes sense to focus on a combination of "refactor all the setup of lvs across dns and puppet" and "modify pybal to autoconfigure pools from k8s". The idea being to bring closer together what we're currently doing for load balanced services in puppet and the abstractions/ideas that k8s has.

My hope is to detach what lives in k8s completely from puppet on the long run, if not for some configuration files whose data can be filled from puppet. My point is (more or less) that pybal adds no value to load-balancing across the pods in the kubernetes world, so we might as well search for alternatives.

Also to clarify further: Pybal does none of the meaningful load-balancing. Load-balancing between pods is done by kube-proxy in any case.

The reason why eliminating another layer of load-balancing is attractive is because under the guise of uniformity, the way things work in kubernetes-land is always different.

I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers. Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?

As Alex pointed out on IRC, the main thing PyBal adds right now is BGP announcement of the service IPs. But this could easily be replaced by any other BGP daemon like Calico itself. And note that this is probably not needed for external traffic (coming in through our caching layer), and merely used by internal services (outside k8s?).

I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers.

That's true.

Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?

It's mostly up to now internal services, it's applications that talk directly to services in kubernetes, few things are via the caching layer.

As Alex pointed out on IRC, the main thing PyBal adds right now is BGP announcement of the service IPs.

Yup. That and depooling a kubernetes node from a backend of said service IPs in case of failure. Which as pointed out below it can happen on the BGP level as well.

But this could easily be replaced by any other BGP daemon like Calico itself. And note that this is probably not needed for external traffic (coming in through our caching layer), and merely used by internal services (outside k8s?).

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

https://github.com/projectcalico/calico/issues/1008 seems relevant.

Keep in mind that with a naive BGP announcement, one of the nodes would 'win' and receive all that traffic (just like one PyBal node does now). There are ways to spread the traffic (essentially anycast), but that will come with other drawbacks/problems.

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

https://github.com/projectcalico/calico/issues/1008 seems relevant.

Yup, but this is for V3.4.0. We are at 2.20, hence my comment that we first need to upgrade calico.

Keep in mind that with a naive BGP announcement, one of the nodes would 'win' and receive all that traffic (just like one PyBal node does now). There are ways to spread the traffic (essentially anycast), but that will come with other drawbacks/problems.

Good point. Perhaps ECMP would work as well, but with the obvious drawbacks as well.

Adding https://metallb.universe.tf/ as a potential solution as well.

Would metallb add anything that we can't already get from calico?

Adding https://metallb.universe.tf/ as a potential solution as well.

Would metallb add anything that we can't already get from calico?

It will take some reading and comparing to fully figure out. From a high level view point, I think we can achieve the same thing with calico indeed. metallb fits better the mental mode of using type: LoadBalancer that some may be used to . It also seems to allow sharing IP addresses by multiple services, but I don't seem much point in it. The metallb.universe.tf/address-pool tagging scheme seems interesting as well. It does require more moving parts, and it does seem like we could get most of what we want with just calico and no need for MetalLB, but it wouldn't be a fair evaluation of our various alternatives if we did not include it.

Change 681470 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] staging-codfw: Enable masquarade_all

https://gerrit.wikimedia.org/r/681470

Change 681472 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/homer/public@master] Add kubernetes service IP ranges to prefix list

https://gerrit.wikimedia.org/r/681472

Change 681473 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] staging-codfw: Advertise service cluster IPs

https://gerrit.wikimedia.org/r/681473

Change 681470 merged by Alexandros Kosiaris:

[operations/puppet@production] staging-codfw: Enable masquarade_all

https://gerrit.wikimedia.org/r/681470

Change 681473 merged by jenkins-bot:

[operations/deployment-charts@master] staging-codfw: Advertise service cluster IPs

https://gerrit.wikimedia.org/r/681473

Change 681472 merged by jenkins-bot:

[operations/homer/public@master] Add kubernetes service IP ranges to prefix list

https://gerrit.wikimedia.org/r/681472

And with the merge and deploy of the above we got:

akosiaris@deploy1002:~$ kube_env proton staging-codfw
akosiaris@deploy1002:~$ kubectl get svc
NAME                                     TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)         AGE
chromium-render-production-tls-service   NodePort   10.192.76.197   <none>        4030:4030/TCP   53d
akosiaris@deploy1002:~$ curl -k https://10.192.76.197:4030/_info
{"name":"proton","version":"1.0.0","description":"A service for converting HTML to PDF using headless Chromium","home":"https://github.com/wikimedia/mediawiki-services-chromium-render"}

And in fact, 4030 there isn't necessary. One small edit in the svc and we got

chromium-render-production-tls-service   NodePort   10.192.76.197   <none>        443:4030/TCP   53d

and curl -k https://10.192.76.197/_info works as well

As a PoC this work well there are a couple of things to dive a bit into:

  • Setup ECMP so that load is spread across all nodes
  • Look into switching to "externalTrafficPolicy":"Local" in order to avoid the 2 layer load balancing
  • Simulate node failures and record/evaluate recovery times
  • Figure out whether a DNS scheme that would allow for a nice transition to this.
  • Something that I 'll find during this.

Many thanks to @ayounsi for his help!

Change 681789 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/homer/public@master] Enable per flow ECMP for kubernetes/kubestage

https://gerrit.wikimedia.org/r/681789

Change 681789 merged by jenkins-bot:

[operations/homer/public@master] Enable per flow ECMP for kubernetes/kubestage

https://gerrit.wikimedia.org/r/681789

Very cool!

  • Look into switching to "externalTrafficPolicy":"Local" in order to avoid the 2 layer load balancing

Curious to hear about that. AIUI this only works for type NodePort or LoadBalancer services. Not ClusterIP ones but I'm unsure if it works with a NodePort service talked to via the ClusterIP.

Very cool!

  • Look into switching to "externalTrafficPolicy":"Local" in order to avoid the 2 layer load balancing

Curious to hear about that. AIUI this only works for type NodePort or LoadBalancer services. Not ClusterIP ones but I'm unsure if it works with a NodePort service talked to via the ClusterIP.

That's my understanding as well. It's one of the things I want to test. It will probably help us inform a decision of whether we want to stick with NodePort or not in our charts.

During testing today, we had some sideline issues because calico-node was dying (as we brought down the network interface on one k8s node for testing). This led to two action items/nice to haves:

  • Mark nodes at not NotReady when critical Daemonsets are not ready (like calico-node)
    • Unfortunately this is nothing we can do with builtin methods. The discussions about this (Node Readiness Gates) seem to always end with the recommendation to start all nodes with a specific taint that is then removed by the mandatory daemonset or a additional controller (like https://github.com/wish/nodetaint)
  • Run calico-node without exponential backoff on Crashloop. This removes potential long wait times until a node comes back up when calico-node has failed a couple of times (e.g. the exponential backoff time is quite high)
  • Simulate node failures and record/evaluate recovery times

We 've looked into this with @JMeybohm. We 've noticed that after a simulated node failure (via ip link set down) the router would take up to 2.5m to retract the routes from the FIB. That created a blackhole in our tests, one that did last a significant amount of time. Thanks to @ayounsi we replicated the test with ECMP turned off (just in case it skewed the results somehow) and this is related to the graceful restart nature of the BGP connection between the router and the nodes. This however is a feature of calico (actually bird), not configurable by the looks of it and most importantly useful as it allows to restart the calico-node component without messing with the data plane, thus allowing traffic to continue to flow to a node that is undergoing changes on the calico configuration (e.g. because we just deployed a small config change via helmfile and all calico-nodes requires a configuration refresh).

The proper way out of this seems to be BFD, but calico doesn't seem to support it. That being said, it doesn't look that much difficult to add it. Plus, we already have an existing use case for it in the form of centrallog (which does use BFD already).

Upstream calico issue at https://github.com/projectcalico/calico/issues/4607

I am also working on a PR, I 'll post here once it's ready.

PR is at https://github.com/projectcalico/confd/pull/515, waiting for review now. It's been tested locally in a couple of bird containers doing a full mesh with each other.