Page MenuHomePhabricator

Reduce DNS queries from istio-proxies to coredns on ML clusters
Closed, ResolvedPublic

Description

Issue originated in T313915#8242495, this task is aimed to track the remaining effort to lower the DNS query pressure to ML coredns pods.

Relevant issues:

https://github.com/istio/istio/issues/31809
https://github.com/istio/istio/issues/13710

Event Timeline

Change 836692 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: disable zipkin and tracing for ml-serve clusters

https://gerrit.wikimedia.org/r/836692

Change 836692 merged by Elukey:

[operations/deployment-charts@master] istio: disable zipkin and tracing for ml-serve clusters

https://gerrit.wikimedia.org/r/836692

Change 836698 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: raise resource quotas for ml-serve clusters

https://gerrit.wikimedia.org/r/836698

Change 836698 merged by Elukey:

[operations/deployment-charts@master] admin_ng: raise resource quotas for ml-serve clusters

https://gerrit.wikimedia.org/r/836698

Change 836734 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] istio: add option to disable dns queries for zipkin on ml-serve

https://gerrit.wikimedia.org/r/836734

Change 836734 merged by Elukey:

[operations/deployment-charts@master] istio: add option to disable dns queries for zipkin on ml-serve

https://gerrit.wikimedia.org/r/836734

Change 836811 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] coredns: add rewrite actions to the config map

https://gerrit.wikimedia.org/r/836811

Change 837069 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] knative-serving: allow dnsConfig settings for autoscaler

https://gerrit.wikimedia.org/r/837069

Change 837069 merged by Elukey:

[operations/deployment-charts@master] knative-serving: allow dnsConfig settings for autoscaler

https://gerrit.wikimedia.org/r/837069

Change 837073 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] admin_ng: add custom DNS ttl rewrites for ml-serve clusters

https://gerrit.wikimedia.org/r/837073

After a bit of digging we should have the correct picture. Let's pick the cluster-local-gateway.istio-system.svc.cluster.local. example, a record with 5 seconds DNS TTL that is requested by all pods with an istio-proxy.

The cluster-local-gateway.istio-system.svc.cluster.local. record is set as Envoy Cluster with STRICT_DNS, so it is fetched every $TTL_SECONDS by Envoy as part of its Service Discovery workflow.

Envoy is also instructed to respect the DNS TTL of the record, it doesn't set any by itself. The cluster-local-gateway.istio-system.svc.cluster.local. endpoint is a regular k8s svc, set by Istio when bootstrapping.

CoreDNS specifies in the docs that unless specified, a record gets a 5s TTL. So in our case, we should configure CoreDNS to rewrite the DNS TTLs to something higher when needed, like 30s (see above patches).

Change 836811 merged by Elukey:

[operations/deployment-charts@master] coredns: add rewrite actions to the config map

https://gerrit.wikimedia.org/r/836811

Change 837073 merged by Elukey:

[operations/deployment-charts@master] admin_ng: add custom DNS ttl rewrites for ml-serve clusters

https://gerrit.wikimedia.org/r/837073

Change 837117 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] Disable zipkin and tracing for wikikube clusters

https://gerrit.wikimedia.org/r/837117

Rolled out the coredns rewrites to all clusters, way better now!

elukey claimed this task.

Queries in both ml-serve clusters are now at around 600 rps, and when we started it was around 15k rps (!!). Kubernetes API latencies are also better now, less spikes and no more alarms reported by icinga.

List of things that we did to keep archives happy:

  1. lowered resolv.conf's ndots from 5 to 2 on all Inference Services pods.
  2. properly disabled zipkin to avoid DNS queries for service discovery.
  3. lowered resolv.conf's ndots from 5 to 2 on Knative's autoscaler pods (to avoid service discovery queries).
  4. Added coredns settings to increase the default k8s service TTL of 5s to 30s for some domains widely used by kserve/knative/istio.

Change 837117 abandoned by JMeybohm:

[operations/deployment-charts@master] Disable zipkin and tracing for wikikube clusters

Reason:

This does not have any effect in wikikube as is seems to only be evaluated for/by mesh proxy containers.

https://gerrit.wikimedia.org/r/837117