Page MenuHomePhabricator

Add an envoy proxy sidecar to Kserve inference pods
Closed, ResolvedPublic

Description

We are currently fetching data from the MW API in our revscoring inference pods, but the ServiveOps team suggests to use a special sidecar to implement basic policies (for example, to create circuit breaking thresholds etc..). We should investigate if it is possible in our setup, so that a burst in traffic for our endpoint will not necessarily translate into hammering other APIs.

Event Timeline

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/741092 is an example of something that may work (needs to be tested), to add a sidecar proxy to all our pods.

A better and cleaner option, in my opinion, could be an istio egress gateway (https://istio.io/latest/docs/tasks/traffic-management/egress/egress-gateway/) but I am wondering if it works for a mesh where mTLS is disabled.

With the following I was able to have something almost working:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: https-endpoints
spec:
  hosts:
  - api-ro.discovery.wmnet
  ports:
  - number: 80
    name: http
    protocol: HTTP
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS

---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: istio-egressgateway
spec:
  selector:
    istio: egressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*.wikipedia.org"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: http-egress-gateway
spec:
  hosts:
  - "*.wikipedia.org"
  gateways:
  - istio-egressgateway
  http:
  - match:
    - gateways:
      - istio-egressgateway
      port: 80
    route:
    - destination:
        host: api-ro.discovery.wmnet
        port:
          number: 443

And the following to allow the egress gw to contact the api:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: istio-egressgateway
  namespace: istio-system
spec:
  egress:
    - ports:
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.1.54/32
    - ports:
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.2.54/32
    - ports:
      - port: 80
        protocol: TCP
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.2.22/32
    - ports:
      - port: 80
        protocol: TCP
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.1.22/32
  ingress:
  - ports:
    - port: 8080
      protocol: TCP
    - port: 8443
      protocol: TCP
  - from:
    - podSelector:
        matchLabels:
          istio: pilot
    ports:
    - port: 15012
      protocol: TCP
  podSelector:
    matchLabels:
      istio: egressgateway
  policyTypes:
  - Ingress
  - Egress

The above still doesn't work with TLS, and some more work needs to be done, but it is nice that each egress gateway (set of) pod(s) has a k8s service in front of it. We could follow multiple roads:

  1. One single egress gateway (maybe multiple pods replicated), all the namespaces will use it as proxy to external services. Easier to maintain, but difficult to manage if we want to apply different rules to the namespaces (like allowing to access some APIs only to some inference services and not others).
  1. One egress gateway for each namespace, in order to control what each namespace can reach. More granular than the above option, but a little bit more cumbersome to maintain (every time we create a namespace we need a separate istioctl egress config etc..).

I was able to use the following config to do allow pods to call via HTTP the egress gateway and force it to use https to connect to the MW api:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: https-endpoints
spec:
  hosts:
  - api-ro.discovery.wmnet
  ports:
  - number: 80
    name: http
    protocol: HTTP
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS

---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: istio-egressgateway
spec:
  selector:
    istio: egressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*.wikipedia.org"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: http-egress-gateway
spec:
  hosts:
  - "*.wikipedia.org"
  gateways:
  - istio-egressgateway
  http:
  - match:
    - gateways:
      - istio-egressgateway
      port: 80
    route:
    - destination:
        host: api-ro.discovery.wmnet
        port:
          number: 443
    headers:
      request:
        remove:
          - x-forwarded-proto
        set:
          x-forwarded-proto: https
          x-forwarded-port: "443"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: https-endpoints-api-ro
spec:
  host: api-ro.discovery.wmnet
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 443
      tls:
        mode: SIMPLE

I am not saying that we shouldn't use https from pods to egress as well, but just that the above works (and it was simpler to test!).

Change 742979 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services

https://gerrit.wikimedia.org/r/742979

Change 742979 merged by Elukey:

[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services

https://gerrit.wikimedia.org/r/742979

I added the following bit to an inference service and it worked!

- name: WIKI_URL
  value: "http://istio-egressgateway.istio-system.svc.cluster.local"

Change 743438 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways

https://gerrit.wikimedia.org/r/743438

Change 743438 merged by Elukey:

[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways

https://gerrit.wikimedia.org/r/743438

Change 746804 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config

https://gerrit.wikimedia.org/r/746804

Change 746804 merged by Elukey:

[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config

https://gerrit.wikimedia.org/r/746804

Change 747153 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters

https://gerrit.wikimedia.org/r/747153

Change 747156 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw

https://gerrit.wikimedia.org/r/747156

Change 747153 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters

https://gerrit.wikimedia.org/r/747153

Change 747461 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration

https://gerrit.wikimedia.org/r/747461

Change 747461 merged by Elukey:

[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration

https://gerrit.wikimedia.org/r/747461

Change 747156 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw

https://gerrit.wikimedia.org/r/747156

We chose to use an istio egress gateway instead of local sidecars. We have deployed it in eqiad and codfw, the last step is to test how well it protects endpoints like the mw api from a burst of requests.

As far as I got SRE uses Envoy's defaults for this (without any specific circuit breaking settings). We should verify our use case, and tune the Istio Gateway config if needed.

Change 757675 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress

https://gerrit.wikimedia.org/r/757675

Change 757675 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress

https://gerrit.wikimedia.org/r/757675