Page MenuHomePhabricator

Investigate high latencies registered by the ml-serve api control plane
Open, Needs TriagePublic

Description

Icinga often alerts like the following:

13:10 +<icinga-wm> PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 
                   verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes 
                   https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27

It happens regularly/daily for both eqiad and codfw. From the graphs it seems that some operations, like LIST, took a long time to complete. From the api-server's logs:

Jun 01 23:04:43 ml-serve-ctrl1001 kube-apiserver[441]: I0601 23:04:43.822180     441 trace.go:116] Trace[1927236208]: "Get" url:/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election (started: 2022-06-01 23:04:43.160531154 +0000 UTC m=+2298798.910262962) (total time: 661.572139ms):

Jun 07 11:25:30 ml-serve-ctrl1001 kube-apiserver[441]: I0607 11:25:30.905215     441 trace.go:116] Trace[949378876]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/knative-serving/leases/webhook.configmapwebhook.00-of-01 (started: 2022-06-07 11:25:30.264235608 +0000 UTC m=+2775246.013967460) (total time: 640.881894ms):
...
Jun 04 03:49:49 ml-serve-ctrl1001 kube-apiserver[441]: I0604 03:49:49.738868     441 trace.go:116] Trace[2120771801]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/knative-serving/leases/istio-webhook.defaultingwebhook.00-of-01 (started: 2022-06-04 03:49:48.853333504 +0000 UTC m=+2488704.603065301) (total time: 885.502384ms):
...
Jun 06 14:55:15 ml-serve-ctrl1001 kube-apiserver[441]: I0606 14:55:15.572804     441 trace.go:116] Trace[174728748]: "Call mutating webhook" configuration:inferenceservice.serving.kserve.io,webhook:inferenceservice.kserve-webhook-server.defaulter,resource:serving.kserve.io/v1beta1, Resource=inferenceservices,subresource:,operation:UPDATE,UID:edd91fb4-1ae9-48c7-8472-a04ceda51a42 (started: 2022-06-06 14:55:14.255533552 +0000 UTC m=+2701430.005265387) (total time: 1.317200595s):
...

At the beginning I thought it was a problem related to DRDB and etcd (so higher latencies when fsyncing etc..) but after moving the ml-etcd1* cluster to a no replication scheme in Ganeti (so removing DRDB), the alerts are still firing.

We should investigate why this is happening.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-06-09T07:53:38Z] <elukey> drop DRDB disk template from ml-etcd2* nodes - T310073

After unsetting DRDB for ml-etcd in eqiad I noticed that more CRITICALs were related to codfw, so I turned off DRDB on ml-etcd codfw as well.

elukey@ml-etcd2001:~$ sudo etcdctl -C https://ml-etcd2001.codfw.wmnet:2379 cluster-health
member 3eaef5f31c9d4f07 is healthy: got healthy result from https://ml-etcd2001.codfw.wmnet:2379
member 6ec81f119df22c02 is healthy: got healthy result from https://ml-etcd2003.codfw.wmnet:2379
member 8ac1f758490b9cfc is healthy: got healthy result from https://ml-etcd2002.codfw.wmnet:2379
cluster is healthy

Something related may be T303184, but in our case it seems that the LIST operations are on avg more expensive (in terms of latency) compared to other clusters.

There are two use cases to investigate, in my opinion:

  • On avg, operations like LIST takes tens of ms and causes warnings in icinga. The values are higher from other clusters, but it may be due to istio/knative/kserve's high number of resources. If so, we should probably try to come up with ad-hoc latency bands for ml-serve and discuss with ServiceOps if they make sense.
  • Spikes of latencies due to isolated events. These causes CRITICALs to be issued, and we should try to figure out what happens.

To note: https://gerrit.wikimedia.org/r/c/operations/puppet/+/808012

The service ops team increased the baseline of tolerance for k8s API latencies, that will apply also to our clusters. Let's see how it goes.