Page MenuHomePhabricator

Evaluate istio as an ingress for production usage
Closed, ResolvedPublic

Description

While we don't use istio as a full service mesh, we make ample use of istio as a gateway in the ml cluster for knative. So, while istio would be too complex as a mere ingress, but given we're already using it, we should indeed evaluate it as an alternative ingress.

Specifically, for each ingress we're answering the following questions:

  • What is the general architecture?
  • How can we deploy it on bare metal?
  • Do we need to build and maintain docker images ourselves?
  • How it can be configured to proxy various services with easy parametrization? Three different ways actually, see Configuration below.
  • How do we operate on it?
  • Is it easy to collect metrics?

    Metrics can easily be collected with prometheus - in fact, istio ships with the correct annotations and thus should easily be picked up by our prometheus without adding any new rule.
  • How do we collect logs?

    Access logs and other logs are easy to collect as well. See:

Thankfully given we already use istio for ML we know the answer to at least the first few questions.

Configuration

There are three ways to configure istio-ingressgateway(s) with different levels of granularity and different feature sets

Kubernetes Ingress (Simple)

https://istio.io/v1.9/docs/tasks/traffic-management/ingress/kubernetes-ingress/

Does not meet requirements, see: T287007#7431081

  • Uses Kubernetes default Ingress objects for configuration: https://people.wikimedia.org/~jayme/k8s-docs/v1.16/docs/concepts/services-networking/ingress/
  • But supports sharing of hostnames between namespaces (due to the central nature of istio-ingressgateway)
  • L7 (HTTP(S)) only
  • TLS certificates need to be placed in the namespace of istio-ingressgateway
  • Just plain host and path prefix matching
  • No advanced features like weight, header matching
  • No "role model" or binding restrictions (e.g. allowing namespaces to use a specific hostname only, enforce default policies ...)
  • Least number of API objects involved

Kubernetes Gateway API (Medium)

https://istio.io/v1.9/docs/tasks/traffic-management/ingress/gateway-api/

High risk of needing much maintenance/migration work in near future as well as adding a hart dependency to specific Istio (and therefore k8s) versions, see: T287007#7431081

  • Uses the "new standard" Kubernetes API which is deemed the successor of Ingress:
  • L7 and L4 routing
  • TLS certificates need to be placed in the namespace of istio-ingressgateway
  • Host and path (prefix and exact) matching
  • Advanced features like weight, header matching and modification,...
  • "Role model" or binding restrictions (e.g. allowing namespaces to use a specific hostname only, enforce default policies ...).
  • API still in v1alpha1 (3rd release), soon to be v1alpha2
  • Needs to be added (CRD) to the cluster
  • May be subject to breaking changes in the future (not completely clear to me but I think we should assume)
  • Already implemented by some other popular ingress controllers (like ambassador, contour, traefik and HAProxy)

Ingress Gateways (Advanced)

https://istio.io/v1.9/docs/tasks/traffic-management/ingress/ingress-control/

Proposed method to continue with!

  • Uses Istio specific API
  • L7 and L4 routing
  • TLS certificates can be placed anywhere
  • Host and path (prefix and exact) matching
  • No "role model" or binding restrictions (e.g. allowing namespaces to use a specific hostname only, enforce default policies ...).
  • Advanced features like weight, header matching and modification,...
  • Very advanced features like fault injection, circuit breaking, mirroring etc.
  • Needs to be added (CRD) to the cluster
  • Specific to istio
  • High number of API objects involved

Event Timeline

Istio can be configured with native ingress resources, using the annotation:

kubernetes.io/ingress.class: istio

see https://istio.io/latest/docs/tasks/traffic-management/ingress/kubernetes-ingress/

The other quirk is that secret added to the ingress will need to be located in the istio namespace (which isn't that surprising, but a deviation from the norm, and also means we need root intervention whenever we want to add an TLS-terminated ingress).

Of course, istio also offers its own custom resource definitions for a richer configuration:

the istio gateway (https://istio.io/latest/docs/reference/config/networking/gateway/) provides what looks like a simplified configuration for envoy, more or less, using three types of CRDs:

  • Gateway to expose the ports we want to listen to
  • VirtualService to indicate the routing prefixes

but I don't think we really need it at this point in time; we might as well wait for the new api gateway interface in newer kubernetes installations, that will probably add more features support.

Metrics can easily be collected with prometheus - in fact, istio ships with the correct annotations and thus should easily be picked up by our prometheus without adding any new rule.

Overall istio looks a lot like the other envoy-based ingress I evaluated, being just quite a bit more complicated because istio can do much, much more than just being an api gateway.

The advantages are, though, the reason why I'm not ruling it out as a viable alternative:

  • no unknown unknowns
  • Reduce the number of technologies we run on top of kubernetes
  • An industry standard that will be maintained for the forseeable future
  • Lower cost of startup for us than any other ingress technology, including nginx.

As it stands, I think we should pick istio as an ingress piggybacking the work done by the machine-learning team.

Change 719265 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] custom_deploy: Add istio manifest for main clusters

https://gerrit.wikimedia.org/r/719265

Change 719265 merged by jenkins-bot:

[operations/deployment-charts@master] custom_deploy: Add istio manifest for main clusters

https://gerrit.wikimedia.org/r/719265

A few questions/points:

  • I see a bullet point TLS certificates need to be placed in the namespace of istio-ingressgateway and a comment by @Joe above, but it doesn't clarify it to me. Does this mean that certs AND private keys get placed in the namespace as Secrets/Configmaps?
  • Care to add an example of Just plain host and path prefix matching (1st case) vs Host and path (prefix and exact) matching (2nd and 3rd case).
  • Interesting that the "Medium" has "Role model" for binding restrictions while "Advanced" doesn't. Is that correct?
  • Piggybacking on the work already done by the ML-team is indeed a huge contributing factor for choosing istio over other ingresses that have way smaller complexity (e.g. linkerd, contour, etc).
  • We 've talked about this already in some channels, but stating it here as well for completeness. My take is that we go as simple as possible on this one. Debugging on kubernetes can already be a difficult task due to complexities of the architecture, adding istio will only add to that complexity. Let's keep the extra things added as lean as possible. So, I guess that means choice #1 for now.

A few questions/points:

  • I see a bullet point TLS certificates need to be placed in the namespace of istio-ingressgateway and a comment by @Joe above, but it doesn't clarify it to me. Does this mean that certs AND private keys get placed in the namespace as Secrets/Configmaps?

Yes, exactly that. Cert and private key for all hostnames we want to use via Ingress need to be placed in the istio-ingressgateway Namespace as secrets. We already have those as ConfigMaps in the namespaces of the services, though.

  • Care to add an example of Just plain host and path prefix matching (1st case) vs Host and path (prefix and exact) matching (2nd and 3rd case).

For 1st case a definition like path: /foo will always be treated as path: /foo.*. For 2nd and 3rd we may choose to have an exact match only instead ( path: /foo != path: /foo.+ ).

  • Interesting that the "Medium" has "Role model" for binding restrictions while "Advanced" doesn't. Is that correct?

I was surprised as well, but I did not find anything regarding that in Istio.

  • We 've talked about this already in some channels, but stating it here as well for completeness. My take is that we go as simple as possible on this one. Debugging on kubernetes can already be a difficult task due to complexities of the architecture, adding istio will only add to that complexity. Let's keep the extra things added as lean as possible. So, I guess that means choice #1 for now.

Agreed. It's also important to add that we could mix and match (assuming we add the required CRDs for the 2nd option) if we see the need. Although I would suppose not to to that to much as that will diverge the way services are configured quite heavily (adding even more complexity).

JMeybohm reopened this task as Open.EditedOct 15 2021, 10:09 AM
JMeybohm claimed this task.

I'll reopen this one as it has more context on the topic of "which API to use for configuration".

I've created a simple istio setup in staging-codfw (T290966) to further evaluate/verify what we discussed here. Unfortunately that came to a hard stop pretty early when trying to create the first ingress for a "live" microservice (shellbox-score).

The problem is that the Kubernetes Ingress API does not natively support HTTPS connections to upstream backends (the microservices). NGINX ingress tackles this problem with a special annotation[1] where Istio does not seem to have an equivalent for. I tried to hack around that by convincing istio that mTLS is possible with the backend, but I think (did not dig further) the ingress-gateway only trusts istio's internal CA in that case (which would make sense ofc).

I now took a closer look at the Kubernetes Gateway API which supports this scenraio in 1alpha1 using a BackendPolicy. Unfortunately that concept got removed in v1alpha2 [3]. While the functionality should be replaced by Policy Attachments it is currently unclear to me how that works and the docs are still missing (the final v1alpha2 release is only 12 hours old, so something might follow). Anyways: Istio currently only supports v1alpha1 of that API and v1alpha2 changes quite some stuff without an automatic/semi-automatic upgrade path so I'm not overly optimistic in terms of the follow up work this would create. It's also pretty unclear to me if and when Itsio will support v1alpha2 but it is obvious that this would only be supported in never Istio versions which potentially require never Kubernetes versions.

To complete the journey I tried to verify this can be done via Ingress Gateway API (the istio native one) as I had doubts now as well. For that one can specify DestinationRules [4] for upstream backends and enable TLS. Thankfully that seems to automatically configure envoy to use the system CA for verification as well. In short: That works.
Using the rewrite feature of that API I was able to hook shellbox-score and shellbox-contraints behind the ingress with "just" three additional API objects.

curl -i -HHost:shellbox.discovery.wmnet "http://kubestage2002.codfw.wmnet:30080/constraints/healthz"
curl -i -HHost:shellbox.discovery.wmnet "http://kubestage2002.codfw.wmnet:30080/score/healthz"

I did complete tests with TLS termination at the ingress-gateway as well now with no further surprises.

[1] https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#backend-protocol
[2] https://gateway-api.sigs.k8s.io/v1alpha1/guides/tls/#upstream-tls
[3] https://github.com/kubernetes-sigs/gateway-api/pull/732
[4] https://istio.io/latest/docs/reference/config/networking/destination-rule/#DestinationRule