Page MenuHomePhabricator

Improve Istio's mesh traffic transparent proxy capabilities for external domains accessed by Lift Wing
Open, Needs TriagePublic10 Estimated Story Points

Description

At the moment when we need access an "external domain" like en.wikipedia.org in a Lift Wing's mesh model server we need to explicitly proxy traffic to api-ro.discovery.wmnet. It would be nice to make this proxy action transparent, so that calling en.wikipedia.org inside an isvc would work without any specific mention of api-ro (via WIKI_URL for example).

Istio seems to be capable of doing it, but we never made it work. There are some things that I discovered today that could help, writing them down as reference for the future.

The idea would be to enable the Istio DNS proxy resolution, for example testing it on a specific isvc/pod via these annotations:

proxy.istio.io/config: |
  proxyMetadata:
    ISTIO_META_DNS_CAPTURE: "true"
    ISTIO_META_DNS_AUTO_ALLOCATE: "true"

Then we could add something like the following to the mesh config:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-svc-proxy
spec:
  hosts:
  - wikipedia.org
  - "*.wikipedia.org"
  location: MESH_EXTERNAL
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: NONE

When DNS proxying is enabled, all DNS traffic generated by a pod goes through a special Istio resolver (within the pod) that is capable of adding extra rules in place, for example from the Service Entry config. In the above example we specify that a DNS resolution for all the wikipedia domains should not be performed, leaving to Virtual Services and Destination rules the duty of handling that traffic. In our current config (Dec 2023) the VS and DR configs are already supporting this, but due to the DNS resolution being unaware of Service Entries, the code tries to resolve Wikipedia's domains ending up in trying to connect to external domains (sometimes they may fail due to IPv6 resolution etc..).

Why don't we enable DNS proxy now? Due to the following:

https://istio.io/latest/docs/setup/additional-setup/cni/#compatibility-with-application-init-containers

The Kserve's storage-initializer container doesn't work anymore if we enable it, but it may if we move its Docker image to uid 1337.

Event Timeline

A first step/test could be to modify the KServe's storage initializer Docker image with uuid 1337, deploy it to staging and see if it works or if it causes troubles with other containers. If it works, we could also remove the following bit from the puppet private's config:

traffic.sidecar.istio.io/excludeOutboundIPRanges: 10.2.2.54/32,10.2.1.54/32

We use it to allow traffic to Thanos Swift to avoid being redirected to the local istio proxy, that is not ready yet when the storage initializer runs.

Then we'd be free to test with Isto DNS proxying :)

Change 984141 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Set ipv6dualstack for ml-staging-codfw

https://gerrit.wikimedia.org/r/984141

I was able to make Revert Risk agnostic to work with WIKI_URL=http://en.wikipedia.org instead of WIKI_URL=api-ro.discovery.wmnet. The only bit that I had to add is rewrite stop type AAAA A to the coredns' config since this happens:

  • Python code tries to resolve en.wikipedia.org via coredns
  • An AAAA record is returned
  • a TCP connection is made, and everything hangs (namely I see a SYN in netstat's output)

If the record returned by coredns is A, everything works. After a chat with serviceops the issue may be https://gerrit.wikimedia.org/r/984141

To make the above WIKI_URL to work I simply added the following Service Entry:

apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: external-svc-redirect
spec:
  hosts:
  - wikipedia.org
  - "*.wikipedia.org"
  location: MESH_EXTERNAL
  ports:
  - number: 80
    name: http
    protocol: HTTP
  resolution: NONE

Change 984144 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] revert-risk: add the option to force HTTP traffic without WIKI_URL set

https://gerrit.wikimedia.org/r/984144

Change 984141 merged by Elukey:

[operations/puppet@production] Set ipv6dualstack for ml-staging-codfw

https://gerrit.wikimedia.org/r/984141

Change 984144 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: add the option to force HTTP traffic without WIKI_URL set

https://gerrit.wikimedia.org/r/984144

Change 984214 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: set new Istio Service Entry for ml-staging-codfw

https://gerrit.wikimedia.org/r/984214

Change 984215 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: force HTTP in revert-risk agnostic staging

https://gerrit.wikimedia.org/r/984215

calbon moved this task from Watching to Blocked on the Machine-Learning-Team board.
calbon moved this task from Blocked to Backlog/Lift Wing on the Machine-Learning-Team board.
calbon set the point value for this task to 1.

Thanks to T353705 we have now the correct ipv6 ranges in puppet, and I was able to enable dual stack in staging. Tried to kill the RR agnostic pod (to allow the kubelet to re-create everything), but I still see the SYN issue mentioned in T353622#9415171

Then I found: https://github.com/istio/istio/issues/46625

The fix seems really close to what we are seeing, and we run 1.15.x (the fixed version is 1.18.x). So I guess that the only step that we have to unblock this is to force coredns to use A records.

Change 984250 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: force coredns to resolve to A records in ml-staging

https://gerrit.wikimedia.org/r/984250

I was able to make Revert Risk agnostic to work without WIKI_URL set to api-ro.discovery.wmnet, with the changes stated in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 (and the next ones in the same chain).

This is how Istio would work after those changes in an isvc:

  • A request for en.wikipedia.org is made, and since we don't use istio-dns-proxy a DNS query to CoreDNS would happen, and a public IP address would be returned (say 208.80.153.224).
  • Then the Python code (via something like aiohttp) would start a TCP connection to 208.80.153.224:80 (we use the 80 port since we want the envoy/istio proxy to be able to inspect the HTTP traffic and take routing decisions etc..).
  • Due to the new Service Entry mediawiki-api-proxied-domains, the istio/envoy sidecar/proxy would catch the request, and using the settings in the Virtual Service it would proxy to api-ro.discovery.wmnet.

Issues:

  • Due to https://github.com/istio/istio/issues/46625, if en.wikipedia.org resolves to a AAAA record (IPv6) the TCP connection wouldn't be caught by envoy and it would hang indefinitely in SYN state.
  • The above solution is still not complete, since it doesn't work for the https redirect use cases outlined in T352958 (but with some work it could).

Change 985347 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] revert-risk: add missing force_http arg in batch model

https://gerrit.wikimedia.org/r/985347

Change 984214 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: set new Istio Service Entry for ml-staging-codfw

https://gerrit.wikimedia.org/r/984214

Change 984250 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: force coredns to resolve to A records in ml-staging

https://gerrit.wikimedia.org/r/984250

Both changes have been merged and pushed to ml-staging.

Change 985347 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] revert-risk: add missing force_http arg in batch model

https://gerrit.wikimedia.org/r/985347

Change #1020808 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: move net_istio configs to a dict

https://gerrit.wikimedia.org/r/1020808

Today I found out https://github.com/istio/istio/issues/21914 after a lot of debugging in staging for T362316. The main issue that I was trying to solve was that after applying the new ServiceEntry/DestinationRule/VirtualService config for T362316, everything worked like a charm, but suddenly stopped if https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 was reverted (manually deleting the extra ServiceEntry).

The only thing that changes from what we are currently doing is the port, hence I found the github issue and I have reproduced the problem in staging: setting port 4680 to api-ro.discovery.wmnet (the old/currently-working-in-prod endpoint configured but on port 80) ended up into the same issue, namely the istio sidecar returning 502 bad gateway (in the istio-proxy log it was clear that the 502 was related to the Blackhole route, that is used for all the services not defined in the istio registry via Service Entries etc..).

After some reading of istio docs, I have re-discovered a note that says:

Requests are routed based on the port and Host header, rather than port and IP. This means the destination IP address is effectively ignored. For example, curl 8.8.8.8 -H "Host: productpage.default.svc.cluster.local", would be routed to the productpage Service.

Source: https://istio.io/latest/docs/ops/configuration/traffic-management/traffic-routing/

What we do with MW API is the following:

Since we use plain HTTP (to have sidecar metrics) then the Host header takes the priority, and in our case with mw-api-int-ro it fails since there is no ServiceEntry for it (like *.wikipedia.org etc..). The sidecar is not allowed to contact those domains and it returns 502. It used to work fine with api-ro.discovery.wmnet since it used the port 80, ending up in https://github.com/istio/istio/issues/21914 or similar.

Thanks to the work done in T353622#9419330, we can use this preliminary work to add a workaround and unblock the migration to the k8s endpoint. After that we may want to transition to a fully transparent setup, something like not specifying WIKI_URL but only forcing HTTP with aiohttp. Since the latter is probably wrong, I'd go with the former first.

Overall steps:

  1. Revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1020738 since it is a big difference with prod that we don't need.
  2. Recheck staging and confirm everything works
  3. Prepare the upgrade of the codfw cluster, that seems the right target.

The upgrade will look like this:

  1. The cluster needs to be depooled first from traffic.
  2. We need to apply coredns changes like https://gerrit.wikimedia.org/r/984250
  3. Then we need to apply the new Service Entry for proxied domains https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/984214 and the other configs for mw-api-int-ro.
  4. We deploy all model servers to switch endpoint (only codfw)
  5. We test all of them via httpbb or similar
  6. We repool traffic

Then after a day or two we'll move to eqiad.

After all the above move, we'll going to check if a more transparent proxy solution is needed.

Change #1020808 merged by Elukey:

[operations/deployment-charts@master] knative-serving: move net_istio configs to a dict

https://gerrit.wikimedia.org/r/1020808

Change #1021490 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: move Istio configs to mw-api-int-ro for ml-serve

https://gerrit.wikimedia.org/r/1021490

Change #1021981 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: allow transparent proxy mode for revscoring isvcs

https://gerrit.wikimedia.org/r/1021981

Change #984215 abandoned by Elukey:

[operations/deployment-charts@master] ml-services: force HTTP in revert-risk agnostic staging

Reason:

https://gerrit.wikimedia.org/r/984215

After some tests I was able to find an istio configuration to support both transparent and non-transparent proxy settings (namely, setting WIKI_URL or similar to a discovery endpoint or just using http://en.wikipedia.org).

Change #1023475 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: remove WIKI_URL from revscoring isvcs

https://gerrit.wikimedia.org/r/1023475

Change #1021490 merged by Elukey:

[operations/deployment-charts@master] admin_ng: move Istio configs to mw-api-int-ro for ml-serve

https://gerrit.wikimedia.org/r/1021490

Change #1021981 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: allow transparent proxy mode for revscoring isvcs

https://gerrit.wikimedia.org/r/1021981

Change #1023475 merged by Elukey:

[operations/deployment-charts@master] ml-services: remove WIKI_URL from revscoring isvcs

https://gerrit.wikimedia.org/r/1023475

Change #1024661 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] kserve-inference: improve transparent proxy settings for revscoring

https://gerrit.wikimedia.org/r/1024661

Change #1024662 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: set host for wikidata's isvc in revscoring staging pods

https://gerrit.wikimedia.org/r/1024662

Change #1024661 merged by Elukey:

[operations/deployment-charts@master] kserve-inference: improve transparent proxy settings for revscoring

https://gerrit.wikimedia.org/r/1024661

Change #1024662 merged by Elukey:

[operations/deployment-charts@master] ml-services: set host for wikidata's isvc in revscoring staging pods

https://gerrit.wikimedia.org/r/1024662

Test in staging has been done, and it was successful! All the revscoring services are now running without WIKI_URL set explicitly.

Something still to fix: if upstream returns a HTTP 301 containing a Location header with https:// everything stops working, since https is not allowed within the istio configs. We should find a workaround in our code, like fixing the Location header if needed.

Opened T363725 for the redirects, as it can be tackled separately.

elukey changed the point value for this task from 1 to 10.May 6 2024, 1:31 PM

The changes have been successfully deployed on all Lift Wing clusters.