We are currently fetching data from the MW API in our revscoring inference pods, but the ServiveOps team suggests to use a special sidecar to implement basic policies (for example, to create circuit breaking thresholds etc..). We should investigate if it is possible in our setup, so that a burst in traffic for our endpoint will not necessarily translate into hammering other APIs.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T272917 Lift Wing proof of concept | |||
Resolved | elukey | T294414 Add an envoy proxy sidecar to Kserve inference pods |
Event Timeline
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/741092 is an example of something that may work (needs to be tested), to add a sidecar proxy to all our pods.
A better and cleaner option, in my opinion, could be an istio egress gateway (https://istio.io/latest/docs/tasks/traffic-management/egress/egress-gateway/) but I am wondering if it works for a mesh where mTLS is disabled.
With the following I was able to have something almost working:
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: https-endpoints spec: hosts: - api-ro.discovery.wmnet ports: - number: 80 name: http protocol: HTTP - number: 443 name: https protocol: HTTPS resolution: DNS --- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: istio-egressgateway spec: selector: istio: egressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - "*.wikipedia.org" --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: http-egress-gateway spec: hosts: - "*.wikipedia.org" gateways: - istio-egressgateway http: - match: - gateways: - istio-egressgateway port: 80 route: - destination: host: api-ro.discovery.wmnet port: number: 443
And the following to allow the egress gw to contact the api:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: istio-egressgateway namespace: istio-system spec: egress: - ports: - port: 443 protocol: TCP to: - ipBlock: cidr: 10.2.1.54/32 - ports: - port: 443 protocol: TCP to: - ipBlock: cidr: 10.2.2.54/32 - ports: - port: 80 protocol: TCP - port: 443 protocol: TCP to: - ipBlock: cidr: 10.2.2.22/32 - ports: - port: 80 protocol: TCP - port: 443 protocol: TCP to: - ipBlock: cidr: 10.2.1.22/32 ingress: - ports: - port: 8080 protocol: TCP - port: 8443 protocol: TCP - from: - podSelector: matchLabels: istio: pilot ports: - port: 15012 protocol: TCP podSelector: matchLabels: istio: egressgateway policyTypes: - Ingress - Egress
The above still doesn't work with TLS, and some more work needs to be done, but it is nice that each egress gateway (set of) pod(s) has a k8s service in front of it. We could follow multiple roads:
- One single egress gateway (maybe multiple pods replicated), all the namespaces will use it as proxy to external services. Easier to maintain, but difficult to manage if we want to apply different rules to the namespaces (like allowing to access some APIs only to some inference services and not others).
- One egress gateway for each namespace, in order to control what each namespace can reach. More granular than the above option, but a little bit more cumbersome to maintain (every time we create a namespace we need a separate istioctl egress config etc..).
I was able to use the following config to do allow pods to call via HTTP the egress gateway and force it to use https to connect to the MW api:
apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: https-endpoints spec: hosts: - api-ro.discovery.wmnet ports: - number: 80 name: http protocol: HTTP - number: 443 name: https protocol: HTTPS resolution: DNS --- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: istio-egressgateway spec: selector: istio: egressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - "*.wikipedia.org" --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: http-egress-gateway spec: hosts: - "*.wikipedia.org" gateways: - istio-egressgateway http: - match: - gateways: - istio-egressgateway port: 80 route: - destination: host: api-ro.discovery.wmnet port: number: 443 headers: request: remove: - x-forwarded-proto set: x-forwarded-proto: https x-forwarded-port: "443" --- apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: https-endpoints-api-ro spec: host: api-ro.discovery.wmnet trafficPolicy: portLevelSettings: - port: number: 443 tls: mode: SIMPLE
I am not saying that we shouldn't use https from pods to egress as well, but just that the above works (and it was simpler to test!).
Change 742979 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services
Change 742979 merged by Elukey:
[operations/deployment-charts@master] Test istio egress gateway endpoint for ml-services
I added the following bit to an inference service and it worked!
- name: WIKI_URL value: "http://istio-egressgateway.istio-system.svc.cluster.local"
Change 743438 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways
Change 743438 merged by Elukey:
[operations/deployment-charts@master] admin_ng: refactor istio helmfile config to allow egress gateways
Change 746804 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config
Change 746804 merged by Elukey:
[operations/deployment-charts@master] custom_deploy.d: add egress gateway settings to the ml-serve's config
Change 747153 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters
Change 747156 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw
Change 747153 merged by Elukey:
[operations/deployment-charts@master] helmfile.d: Add Istio Egress config for ml-serve clusters
Change 747461 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration
Change 747461 merged by Elukey:
[operations/deployment-charts@master] knative-serving: refactor istio egress gateway configuration
Change 747156 merged by Elukey:
[operations/deployment-charts@master] helmfile.d: Configure all ml-services to use the Istio egress gw
We chose to use an istio egress gateway instead of local sidecars. We have deployed it in eqiad and codfw, the last step is to test how well it protects endpoints like the mw api from a burst of requests.
As far as I got SRE uses Envoy's defaults for this (without any specific circuit breaking settings). We should verify our use case, and tune the Istio Gateway config if needed.
Change 757675 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress
Really nice document to read: https://tech.olx.com/demystifying-istio-circuit-breaking-27a69cac2ce4
Change 757675 merged by Elukey:
[operations/deployment-charts@master] helmfile.d: add circuit breaking settings for ml-serve's egress