Add an envoy proxy sidecar to Kserve inference pods
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Oct 27 2021, 8:26 AM

Description

We are currently fetching data from the MW API in our revscoring inference pods, but the ServiveOps team suggests to use a special sidecar to implement basic policies (for example, to create circuit breaking thresholds etc..). We should investigate if it is possible in our setup, so that a burst in traffic for our endpoint will not necessarily translate into hammering other APIs.

Details

Subject	Repo	Branch	Lines +/-
helmfile.d: add circuit breaking settings for ml-serve's egress	operations/deployment-charts	master	+15 -1
helmfile.d: Configure all ml-services to use the Istio egress gw	operations/deployment-charts	master	+9 -48
knative-serving: refactor istio egress gateway configuration	operations/deployment-charts	master	+200 -170
helmfile.d: Add Istio Egress config for ml-serve clusters	operations/deployment-charts	master	+85 -1
custom_deploy.d: add egress gateway settings to the ml-serve's config	operations/deployment-charts	master	+3 -0
admin_ng: refactor istio helmfile config to allow egress gateways	operations/deployment-charts	master	+45 -17
Test istio egress gateway endpoint for ml-services	operations/deployment-charts	master	+2 -2

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T272917 Lift Wing proof of concept
		Resolved		elukey	T294414 Add an envoy proxy sidecar to Kserve inference pods

Event Timeline

elukey created this task.Oct 27 2021, 8:26 AM

elukey mentioned this in T294419: Factor out feature retrieve functionality to a transformer.Nov 2 2021, 2:47 PM

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/741092 is an example of something that may work (needs to be tested), to add a sidecar proxy to all our pods.

A better and cleaner option, in my opinion, could be an istio egress gateway (https://istio.io/latest/docs/tasks/traffic-management/egress/egress-gateway/) but I am wondering if it works for a mesh where mTLS is disabled.

With the following I was able to have something almost working:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: https-endpoints
spec:
  hosts:
  - api-ro.discovery.wmnet
  ports:
  - number: 80
    name: http
    protocol: HTTP
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS

---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: istio-egressgateway
spec:
  selector:
    istio: egressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*.wikipedia.org"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: http-egress-gateway
spec:
  hosts:
  - "*.wikipedia.org"
  gateways:
  - istio-egressgateway
  http:
  - match:
    - gateways:
      - istio-egressgateway
      port: 80
    route:
    - destination:
        host: api-ro.discovery.wmnet
        port:
          number: 443

And the following to allow the egress gw to contact the api:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: istio-egressgateway
  namespace: istio-system
spec:
  egress:
    - ports:
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.1.54/32
    - ports:
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.2.54/32
    - ports:
      - port: 80
        protocol: TCP
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.2.22/32
    - ports:
      - port: 80
        protocol: TCP
      - port: 443
        protocol: TCP
      to:
      - ipBlock:
          cidr: 10.2.1.22/32
  ingress:
  - ports:
    - port: 8080
      protocol: TCP
    - port: 8443
      protocol: TCP
  - from:
    - podSelector:
        matchLabels:
          istio: pilot
    ports:
    - port: 15012
      protocol: TCP
  podSelector:
    matchLabels:
      istio: egressgateway
  policyTypes:
  - Ingress
  - Egress

The above still doesn't work with TLS, and some more work needs to be done, but it is nice that each egress gateway (set of) pod(s) has a k8s service in front of it. We could follow multiple roads:

One single egress gateway (maybe multiple pods replicated), all the namespaces will use it as proxy to external services. Easier to maintain, but difficult to manage if we want to apply different rules to the namespaces (like allowing to access some APIs only to some inference services and not others).

One egress gateway for each namespace, in order to control what each namespace can reach. More granular than the above option, but a little bit more cumbersome to maintain (every time we create a namespace we need a separate istioctl egress config etc..).

I was able to use the following config to do allow pods to call via HTTP the egress gateway and force it to use https to connect to the MW api:

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: https-endpoints
spec:
  hosts:
  - api-ro.discovery.wmnet
  ports:
  - number: 80
    name: http
    protocol: HTTP
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS

---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: istio-egressgateway
spec:
  selector:
    istio: egressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*.wikipedia.org"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: http-egress-gateway
spec:
  hosts:
  - "*.wikipedia.org"
  gateways:
  - istio-egressgateway
  http:
  - match:
    - gateways:
      - istio-egressgateway
      port: 80
    route:
    - destination:
        host: api-ro.discovery.wmnet
        port:
          number: 443
    headers:
      request:
        remove:
          - x-forwarded-proto
        set:
          x-forwarded-proto: https
          x-forwarded-port: "443"
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: https-endpoints-api-ro
spec:
  host: api-ro.discovery.wmnet
  trafficPolicy:
    portLevelSettings:
    - port:
        number: 443
      tls:
        mode: SIMPLE

I am not saying that we shouldn't use https from pods to egress as well, but just that the above works (and it was simpler to test!).

Change 742979 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services

https://gerrit.wikimedia.org/r/742979

gerritbot added a project: Patch-For-Review.Dec 1 2021, 3:44 PM

Change 742979 merged by Elukey:

[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services

https://gerrit.wikimedia.org/r/742979

Maintenance_bot removed a project: Patch-For-Review.Dec 1 2021, 4:10 PM

I added the following bit to an inference service and it worked!

- name: WIKI_URL
  value: "http://istio-egressgateway.istio-system.svc.cluster.local"

Change 743438 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways

https://gerrit.wikimedia.org/r/743438

gerritbot added a project: Patch-For-Review.Dec 3 2021, 3:50 PM

Change 743438 merged by Elukey:

[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways

https://gerrit.wikimedia.org/r/743438

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2021, 9:10 AM

Change 746804 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config

https://gerrit.wikimedia.org/r/746804

gerritbot added a project: Patch-For-Review.Dec 13 2021, 9:51 AM

Change 746804 merged by Elukey:

[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config

https://gerrit.wikimedia.org/r/746804

Maintenance_bot removed a project: Patch-For-Review.Dec 13 2021, 10:10 AM

elukey mentioned this in T297612: Experiment with the Istio TLS mesh.Dec 13 2021, 2:47 PM

Change 747153 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters

https://gerrit.wikimedia.org/r/747153

gerritbot added a project: Patch-For-Review.Dec 14 2021, 4:12 PM

Change 747156 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw

https://gerrit.wikimedia.org/r/747156

Change 747153 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters

https://gerrit.wikimedia.org/r/747153

Change 747461 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration

https://gerrit.wikimedia.org/r/747461

Change 747461 merged by Elukey:

[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration

https://gerrit.wikimedia.org/r/747461

Change 747156 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw

https://gerrit.wikimedia.org/r/747156

Maintenance_bot removed a project: Patch-For-Review.Dec 15 2021, 10:10 AM

We chose to use an istio egress gateway instead of local sidecars. We have deployed it in eqiad and codfw, the last step is to test how well it protects endpoints like the mw api from a burst of requests.

As far as I got SRE uses Envoy's defaults for this (without any specific circuit breaking settings). We should verify our use case, and tune the Istio Gateway config if needed.

calbon moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Jan 11 2022, 7:37 PM

calbon moved this task from Backlog to Parked on the Machine-Learning-Team (Active Tasks) board.

calbon moved this task from Parked to In Progress on the Machine-Learning-Team (Active Tasks) board.Jan 19 2022, 6:26 PM

elukey claimed this task.Jan 19 2022, 6:31 PM

Change 757675 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress

https://gerrit.wikimedia.org/r/757675

gerritbot added a project: Patch-For-Review.Jan 27 2022, 3:53 PM

Really nice document to read: https://tech.olx.com/demystifying-istio-circuit-breaking-27a69cac2ce4

Change 757675 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress

https://gerrit.wikimedia.org/r/757675

Maintenance_bot removed a project: Patch-For-Review.Feb 3 2022, 4:10 PM

elukey moved this task from In Progress to Parked on the Machine-Learning-Team (Active Tasks) board.Mar 7 2022, 6:11 PM

Done as part of T297612

Add an envoy proxy sidecar to Kserve inference podsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Add an envoy proxy sidecar to Kserve inference pods
Closed, ResolvedPublic
Actions

Related Objects
Search...