Page MenuHomePhabricator

Proposal: simplify set up of a new load-balanced service on kubernetes
Open, MediumPublic

Description

To deploy a new service in production on kubernetes right now there is a set of thing that need to be done. Marked as [SRE] or [service owner] in the list below

Group A: deployment of the service on kubernetes

  1. set up the appropriate set of values in helmfiles in deployment-charts [service-owner]
  2. set up the user token/credentials and other private data in the private puppet repository [SRE]
  3. set up the corresponding namespace on all clusters [SRE]
  4. deploy the service to all clusters [service-owner]

%Group B is all SRE actions.

Group B: Setting up LVS (load balancer)

  1. Add the new service to every kube worker in conftool-data, discovery
  2. Add the service IP to all the kube workers on loopback
  3. Add the service to DNS records (both normal and discovery)
  4. Fill in the service metadata to change the LVS configuration
  5. Restart all the relevant LVS
  6. Switch the monitoring of these endpoints to critical: true in puppet to add paging

While optimizing / eliminating any of these steps is nice, the biggest time-sink is without doubt the setup of LVS. It's long, complicated, failure prone, and only a handful of SREs are confident around the process.

How can we make that process better? In the reminder of this task I'll describe a few possible approaches.

Set up an ingress

This means allowing service owners to setup kubernetes resources that are specifically tailored to load-balancing and routing traffic incoming from the exterior of the cluster. It provides a pretty simple interface to manage externally. Unlike other proposals below, this is a L7 loadbalancer meaning it understands TLS, virtualhosts, HTTP etc.

Depending on where the implementation software is installed we have the following to paths. Both are valid approaches.

Inside kubernetes cluster

It means a request coming from the public would go as follows:

client => LB (pybal) => kube worker => kube-proxy (not a real hop, but does DNAT) => ingress => pods

Outside the kubernetes cluster

client => LB (pybal) => ingress => pods

This would add one node to the chain of proxying, and more moving parts. We would need to investigate Ingress solutions once more as we 've done in T170121, which was ~2,5 years ago. Things have changed since then.

How would Group B actions look like
  1. Add a cname to the ingress.
  2. Add some monitoring/alerting

That's it. A simple puppet patch and a simple dns patch. For very large services, maybe add a per-namespace setup.

Pros/cons

pros:

  • Integrated with kubernetes
  • industry "standard"
  • the rest of the infrastructure is left unmodified
  • L7 functionality.

cons

  • more moving parts
  • Ingress quality/stability at scale needs to be evaluated.
  • yet more complexity in our charts
  • Adding a potential SPOF (that is in some aspects addressable) as well as a potential chokepoint.
  • Essentially HTTP only. Specific implementations may support more protocols but overall the Ingress resources wasn't designed for this.

Modify pybal to autoconfigure pools from k8s

We already have a pybal patch ensuring we can fetch which workers are active from k8s instead of etcd, but we could expand it further to read /all/ data pybal needs from k8s, including pods. We would still need a consistent way to add IPs to the Load balancers and the k8s nodes, but that can be mostly done with some additional improvements [citation needed].

The flow of requests would be client => LB (pybal) => pod (via kube-proxy)

How would Group B actions look like
  1. Add the dns record for the new service
  2. Add the realserver IP on the kubernetes workers and the load balancers
  3. Let pybal add the configuration when the service is properly annotated.
  4. Add monitoring/alerting.

Three relatively simple patches (one to DNS, two to puppet). Some coordination is needed.

Pros/cons

pros:

  • no change to our current setup
  • Known unknowns. Pybal is mostly 'boring'
  • LVS-DR

cons:

  • Still an invented here solution
  • Not fully automated service addition, will still need to add IPs to the backends somehow.
  • Will need significant dev effort
  • Lack of L7 support

kube-proxy + bird

In this hypothesis, we'd have kube-proxy doing all the load-balancing, and announcing the LVS IPs via bird directly.

In this case, we'd have the simplest request flow:
client => kube-proxy => pod

In this hypothesis, we should configure some bgp daemon depending on which IPs we have configured on k8s, and run it as a sidekick of kube-proxy. One of the complications of this is that calico relies on running bird on each kubernetes nodes, so we 'd either have to setup kube-proxy+bird outside the cluster (mostly ending up resembling LVS), or we would need to figure out how to augment calico's bird configuration if we want to host it on the workers. This is essentially a variant of the pybal approach above.

How would Group B actions look like
  1. Add the dns record for the new service
  2. Add the realserver IP on the kubernetes workers (this can probably be automated by using annotations in the k8s api, but is it worth it?)
  3. Add monitoring/alerting.

Three relatively simple patches (one to DNS, two to puppet).

Pros/cons

pros:

  • least hops for a request
  • No additional moving parts besides bird
  • Overall the simplest configuration

cons:

  • Unknown cost of working on a solid bgp announcement system.
  • Might need additional configuration to know which IPs to serve
  • Lack of L7 support
  • No LVS-DR

Refactor all the setup of LVS across dns and puppet

It's probably possible to simplify the steps to set up a load-balanced service by rationalizing the puppet code around it (for example, synchronizing systems across various stages, or allowing to add a new service all in a patch and not in 3 different ones).

pros

  • no new technology would be introduced in production

cons

  • No clear implementation idea
  • Might never achieve a fully streamlined solution

Event Timeline

Joe triaged this task as Medium priority.Nov 22 2019, 9:42 AM
Joe created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 22 2019, 9:42 AM
akosiaris updated the task description. (Show Details)Nov 25 2019, 11:49 AM
jbond added a subscriber: jbond.Nov 25 2019, 5:32 PM
CDanis added a subscriber: CDanis.Nov 25 2019, 5:33 PM
jijiki added a subscriber: jijiki.Nov 25 2019, 5:33 PM
ema added a subscriber: ema.Nov 25 2019, 5:34 PM

First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.

From my POV there's great value in having a single solution for load balancing across the infra (i.e. pybal + ipvs-dr today) in terms of routine operations (e.g. pool/depool) and cognitive load (all services are load balanced the same way).

Since we'll keep provisioning/decommissioning load balanced services (that live both inside outside k8s) via puppet I think it makes sense to focus on a combination of "refactor all the setup of lvs across dns and puppet" and "modify pybal to autoconfigure pools from k8s". The idea being to bring closer together what we're currently doing for load balanced services in puppet and the abstractions/ideas that k8s has.

Joe added a comment.EditedNov 26 2019, 2:46 PM

From my POV there's great value in having a single solution for load balancing across the infra (i.e. pybal + ipvs-dr today) in terms of routine operations (e.g. pool/depool) and cognitive load (all services are load balanced the same way).

In practice, pybal does little for kubernetes nodes and OTOH it's mostly just adding a level of indirection. Most real load-balancing is actually performed by kubernetes itself inside the cluster. But I get your point in general.

Since we'll keep provisioning/decommissioning load balanced services (that live both inside outside k8s) via puppet I think it makes sense to focus on a combination of "refactor all the setup of lvs across dns and puppet" and "modify pybal to autoconfigure pools from k8s". The idea being to bring closer together what we're currently doing for load balanced services in puppet and the abstractions/ideas that k8s has.

My hope is to detach what lives in k8s completely from puppet on the long run, if not for some configuration files whose data can be filled from puppet. My point is (more or less) that pybal adds no value to load-balancing across the pods in the kubernetes world, so we might as well search for alternatives.

Joe added a comment.Nov 28 2019, 7:12 AM

Also to clarify further: Pybal does none of the meaningful load-balancing. Load-balancing between pods is done by kube-proxy in any case.

The reason why eliminating another layer of load-balancing is attractive is because under the guise of uniformity, the way things work in kubernetes-land is always different.

Joe updated the task description. (Show Details)Nov 28 2019, 7:35 AM
mark added a subscriber: mark.Dec 10 2019, 11:43 AM

I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers. Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?

As Alex pointed out on IRC, the main thing PyBal adds right now is BGP announcement of the service IPs. But this could easily be replaced by any other BGP daemon like Calico itself. And note that this is probably not needed for external traffic (coming in through our caching layer), and merely used by internal services (outside k8s?).

I agree - it seems that PyBal adds no real value here, because it's essentially load balancing the k8s load balancers.

That's true.

Why couldn't our caching layer do that directly, and know about all the k8s proxies/nodes directly and do health checks for them?

It's mostly up to now internal services, it's applications that talk directly to services in kubernetes, few things are via the caching layer.

As Alex pointed out on IRC, the main thing PyBal adds right now is BGP announcement of the service IPs.

Yup. That and depooling a kubernetes node from a backend of said service IPs in case of failure. Which as pointed out below it can happen on the BGP level as well.

But this could easily be replaced by any other BGP daemon like Calico itself. And note that this is probably not needed for external traffic (coming in through our caching layer), and merely used by internal services (outside k8s?).

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

mark added a comment.EditedDec 10 2019, 12:01 PM

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

https://github.com/projectcalico/calico/issues/1008 seems relevant.

Keep in mind that with a naive BGP announcement, one of the nodes would 'win' and receive all that traffic (just like one PyBal node does now). There are ways to spread the traffic (essentially anycast), but that will come with other drawbacks/problems.

True. We could investigate how to configure calico to announce those IPs (but upgrade calico first to a newer and supported version) if possible.

https://github.com/projectcalico/calico/issues/1008 seems relevant.

Yup, but this is for V3.4.0. We are at 2.20, hence my comment that we first need to upgrade calico.

Keep in mind that with a naive BGP announcement, one of the nodes would 'win' and receive all that traffic (just like one PyBal node does now). There are ways to spread the traffic (essentially anycast), but that will come with other drawbacks/problems.

Good point. Perhaps ECMP would work as well, but with the obvious drawbacks as well.

ema moved this task from Triage to LoadBalancer on the Traffic board.Dec 13 2019, 8:46 AM