Issue originated in T313915#8242495, this task is aimed to track the remaining effort to lower the DNS query pressure to ML coredns pods.
Relevant issues:
https://github.com/istio/istio/issues/31809
https://github.com/istio/istio/issues/13710
Issue originated in T313915#8242495, this task is aimed to track the remaining effort to lower the DNS query pressure to ML coredns pods.
Relevant issues:
https://github.com/istio/istio/issues/31809
https://github.com/istio/istio/issues/13710
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T272917 Lift Wing proof of concept | |||
Resolved | elukey | T318814 Reduce DNS queries from istio-proxies to coredns on ML clusters |
Change 836692 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] istio: disable zipkin and tracing for ml-serve clusters
Change 836692 merged by Elukey:
[operations/deployment-charts@master] istio: disable zipkin and tracing for ml-serve clusters
Change 836698 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: raise resource quotas for ml-serve clusters
Change 836698 merged by Elukey:
[operations/deployment-charts@master] admin_ng: raise resource quotas for ml-serve clusters
Change 836734 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] istio: add option to disable dns queries for zipkin on ml-serve
Change 836734 merged by Elukey:
[operations/deployment-charts@master] istio: add option to disable dns queries for zipkin on ml-serve
Change 836811 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] coredns: add rewrite actions to the config map
Change 837069 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] knative-serving: allow dnsConfig settings for autoscaler
Change 837069 merged by Elukey:
[operations/deployment-charts@master] knative-serving: allow dnsConfig settings for autoscaler
Change 837073 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] admin_ng: add custom DNS ttl rewrites for ml-serve clusters
After a bit of digging we should have the correct picture. Let's pick the cluster-local-gateway.istio-system.svc.cluster.local. example, a record with 5 seconds DNS TTL that is requested by all pods with an istio-proxy.
The cluster-local-gateway.istio-system.svc.cluster.local. record is set as Envoy Cluster with STRICT_DNS, so it is fetched every $TTL_SECONDS by Envoy as part of its Service Discovery workflow.
Envoy is also instructed to respect the DNS TTL of the record, it doesn't set any by itself. The cluster-local-gateway.istio-system.svc.cluster.local. endpoint is a regular k8s svc, set by Istio when bootstrapping.
CoreDNS specifies in the docs that unless specified, a record gets a 5s TTL. So in our case, we should configure CoreDNS to rewrite the DNS TTLs to something higher when needed, like 30s (see above patches).
Change 836811 merged by Elukey:
[operations/deployment-charts@master] coredns: add rewrite actions to the config map
Change 837073 merged by Elukey:
[operations/deployment-charts@master] admin_ng: add custom DNS ttl rewrites for ml-serve clusters
Change 837117 had a related patch set uploaded (by JMeybohm; author: JMeybohm):
[operations/deployment-charts@master] Disable zipkin and tracing for wikikube clusters
Queries in both ml-serve clusters are now at around 600 rps, and when we started it was around 15k rps (!!). Kubernetes API latencies are also better now, less spikes and no more alarms reported by icinga.
List of things that we did to keep archives happy:
Change 837117 abandoned by JMeybohm:
[operations/deployment-charts@master] Disable zipkin and tracing for wikikube clusters
Reason:
This does not have any effect in wikikube as is seems to only be evaluated for/by mesh proxy containers.