Page MenuHomePhabricator

Experiment with a TLS proxy/router for pods
Closed, ResolvedPublic

Description

We would like to have TLS termination (and initiation) right from the start for all services running in our kubernetes clusters. That being said, expecting services to reliably implement TLS (both inbound and outbound) is not the best path forward since it duplicates a lot of effort, exposes to the application a lot of the rather complicated internals of TLS and encourages diversion of configurations and codebases. Instead we could use a forward+reverse proxy scheme where a well tested and trusted software with a single configuration completes these tasks. The forward part is about outbound traffic from applications to other applications and the reverse proxy is the usual tlsproxy like implementation we have. This is a field still being worked one with ideas like istio [1] or envoy[2] or linkerd [3]

We should experiment with those ideas

[1] https://istio.io/
[2] https://envoyproxy.github.io/
[3] https://linkerd.io/

Event Timeline

I am experimenting a bit with envoy, and while it's interesting, and probably something we want to implement on the long run, I'm not sure we're happy to delegate routing/health checking to it for service-to-service calls at the moment. It both seems like a good idea and a too large shift in how we do things. Since all of our other services are outside of kubernetes, I'd start configuring it as a simple http reverse proxy with TLS termination.

The advantage over using nginx for this role is that we'll start getting our ears wet with envoy right away, and we can enable other functions at a later time.

To recap my experiments:

  • I built envoy following our container build guidelines, but I am blocked on building it in production as it fails to build when behind a firewall/proxy, see https://github.com/envoyproxy/envoy/issues/2180
  • I tested it with a self-signed cert for TLS termination on my own minikube installation, and it seems to work well (but I just proxied to a local echoserver)
  • I could see the metrics and the admin endpoint, but I'm not sure how we could make use of the admin endpoint, or why we would.
  • I had the time to study the load-balancing/routing that envoy is able to do, and it's on par (if not superior to) pybal in terms of features. See https://www.envoyproxy.io/docs/envoy/latest/api-v2/cds.proto (the outlier detection part is particularly interesting).
  • I experimented with circuit breaking https://www.envoyproxy.io/docs/envoy/latest/api-v2/cds.proto#circuitbreakers in a very simple form (so, a client being circuit-broken) and it works as advertised, I'm not 100% sure how we could implement it in our case

Overall, while I'm not sold on complex environments like istio (a lot of moving parts, to be honest), envoy seems simple enough in itself to have the potential to be a perfect frontend for our pods.

My 1000ft overview of how a service would work would be:

Phase 1

  • every service has an envoy sidecar in the pod
  • envoy will take care of exposing the local_service to the rest of kubernetes

This can be easily done by tweaking our current helm chart scaffolding, and with some work for generating TLS certs. This will allow us to have encrypted communications within kubernetes and outside of it (if we want/can), and give us a reliable telemetry.

Phase 2

  • every service is configured to reach the remote services it needs via envoy
  • envoy is configured with simple, static routes to the service (within k8s) or the load balancer of the service.
  • envoy is configured with circuit breakers for all other services, and locally.

This can be probably interesting to carry on once we have our first services on kubernetes (so after next quarter).

Phase 3

  • Instead of relying on static things like k8s service IP/port pairs, envoy could interact with the k8s api to maintain a dynamic list of backends, for k8s things, and with another api (pybal's?) for the other services.

To recap my experiments:

  • I built envoy following our container build guidelines, but I am blocked on building it in production as it fails to build when behind a firewall/proxy, see https://github.com/envoyproxy/envoy/issues/2180
  • I tested it with a self-signed cert for TLS termination on my own minikube installation, and it seems to work well (but I just proxied to a local echoserver)
  • I could see the metrics and the admin endpoint, but I'm not sure how we could make use of the admin endpoint, or why we would.
  • I had the time to study the load-balancing/routing that envoy is able to do, and it's on par (if not superior to) pybal in terms of features. See https://www.envoyproxy.io/docs/envoy/latest/api-v2/cds.proto (the outlier detection part is particularly interesting).
  • I experimented with circuit breaking https://www.envoyproxy.io/docs/envoy/latest/api-v2/cds.proto#circuitbreakers in a very simple form (so, a client being circuit-broken) and it works as advertised, I'm not 100% sure how we could implement it in our case

Nice!

Overall, while I'm not sold on complex environments like istio (a lot of moving parts, to be honest), envoy seems simple enough in itself to have the potential to be a perfect frontend for our pods.

Also a fan of the KISS principle, let's add complexity in small incremental steps. Kubernetes already adds a ton of complexity for operations, we want this to continue being debuggable, especially at the beginning.

My 1000ft overview of how a service would work would be:

Phase 1

  • every service has an envoy sidecar in the pod
  • envoy will take care of exposing the local_service to the rest of kubernetes

This can be easily done by tweaking our current helm chart scaffolding, and with some work for generating TLS certs. This will allow us to have encrypted communications within kubernetes and outside of it (if we want/can), and give us a reliable telemetry.

There are few questions. Things like

  • Should we go with the TLS certs right from the beginning ? Maybe it makes sense to avoid them in the very first phase.
  • Should we expose both the service and the envoy endpoint in the beginning. It could make debugging/experimentation easier at first. Later on we can stop the practice of course.
  • If we do the TLS certs dance, how will we do it. We want it to be as painless as possible.

Phase 2

  • every service is configured to reach the remote services it needs via envoy
  • envoy is configured with simple, static routes to the service (within k8s) or the load balancer of the service.
  • envoy is configured with circuit breakers for all other services, and locally.

I am guessing the first bullet point is just setting the http_proxy variable, right ?

I am lost a bit on the simple static routes part. This refers to some envoy terminology or some does this actually mean IP routes ?

This can be probably interesting to carry on once we have our first services on kubernetes (so after next quarter).

Phase 3

  • Instead of relying on static things like k8s service IP/port pairs, envoy could interact with the k8s api to maintain a dynamic list of backends, for k8s things, and with another api (pybal's?) for the other services.

This is further down the road. It sounds nice, but we are far from it yet.

There are few questions. Things like

  • Should we go with the TLS certs right from the beginning ? Maybe it makes sense to avoid them in the very first phase.

As long as we have 1 deployment - 1 TLS cert, it should be easy to do. But yeah, our first deployment should probably avoid going the full TLS way.

  • Should we expose both the service and the envoy endpoint in the beginning. It could make debugging/experimentation easier at first. Later on we can stop the practice of course.

Agreed, at least as a first iteration

  • If we do the TLS certs dance, how will we do it. We want it to be as painless as possible.

I think in production we just generate a private key/cert pair from the puppet CA for every service, then distribute them as a secret.

Phase 2

  • every service is configured to reach the remote services it needs via envoy
  • envoy is configured with simple, static routes to the service (within k8s) or the load balancer of the service.
  • envoy is configured with circuit breakers for all other services, and locally.

I am guessing the first bullet point is just setting the http_proxy variable, right ?

No, the idea would be that - say - RESTbase calls Mathoid on mathoid.proxy:<envoy-port> locally, which resolves to localhost:<envoy-port>. Envoy then manages the connections to the remote service, being able to do quite a few nice things.

I am lost a bit on the simple static routes part. This refers to some envoy terminology or some does this actually mean IP routes ?

No sorry, it's envoy terminology. A static route means you pass envoy a list of servers to contact for a given service statically in the config file, as opposed to configuring envoy to get the list of servers dynamically (either via DNS or a discovery API).

Phase 3

  • Instead of relying on static things like k8s service IP/port pairs, envoy could interact with the k8s api to maintain a dynamic list of backends, for k8s things, and with another api (pybal's?) for the other services.

This is further down the road. It sounds nice, but we are far from it yet.

Agreed. I still want to think of where we want to get. But it might well be that by then istio will be proven enough as a technology that we can confidently adopt it.

There are few questions. Things like

  • Should we go with the TLS certs right from the beginning ? Maybe it makes sense to avoid them in the very first phase.

As long as we have 1 deployment - 1 TLS cert, it should be easy to do. But yeah, our first deployment should probably avoid going the full TLS way.

OK agreed.

  • Should we expose both the service and the envoy endpoint in the beginning. It could make debugging/experimentation easier at first. Later on we can stop the practice of course.

Agreed, at least as a first iteration

  • If we do the TLS certs dance, how will we do it. We want it to be as painless as possible.

I think in production we just generate a private key/cert pair from the puppet CA for every service, then distribute them as a secret.

OK, sounds sane enough.

Phase 2

  • every service is configured to reach the remote services it needs via envoy
  • envoy is configured with simple, static routes to the service (within k8s) or the load balancer of the service.
  • envoy is configured with circuit breakers for all other services, and locally.

I am guessing the first bullet point is just setting the http_proxy variable, right ?

No, the idea would be that - say - RESTbase calls Mathoid on mathoid.proxy:<envoy-port> locally, which resolves to localhost:<envoy-port>. Envoy then manages the connections to the remote service, being able to do quite a few nice things.

I need to read up more about this. I had a quick look last quarter at that and somehow I wasn't sure about it back then.

I am lost a bit on the simple static routes part. This refers to some envoy terminology or some does this actually mean IP routes ?

No sorry, it's envoy terminology. A static route means you pass envoy a list of servers to contact for a given service statically in the config file, as opposed to configuring envoy to get the list of servers dynamically (either via DNS or a discovery API).

Ah nice ok, thanks for the clarification

Phase 3

  • Instead of relying on static things like k8s service IP/port pairs, envoy could interact with the k8s api to maintain a dynamic list of backends, for k8s things, and with another api (pybal's?) for the other services.

This is further down the road. It sounds nice, but we are far from it yet.

Agreed. I still want to think of where we want to get. But it might well be that by then istio will be proven enough as a technology that we can confidently adopt it.

OK.

I think we can call this done ?

Indeed, we have an implementation path, the dockerfiles for building the software, and some plans for the future. We can call this done.