Icinga often alerts like the following:
13:10 +<icinga-wm> PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
It happens regularly/daily for both eqiad and codfw. From the graphs it seems that some operations, like LIST, took a long time to complete. From the api-server's logs:
Jun 01 23:04:43 ml-serve-ctrl1001 kube-apiserver[441]: I0601 23:04:43.822180 441 trace.go:116] Trace[1927236208]: "Get" url:/api/v1/namespaces/kube-system/configmaps/cert-manager-cainjector-leader-election (started: 2022-06-01 23:04:43.160531154 +0000 UTC m=+2298798.910262962) (total time: 661.572139ms): Jun 07 11:25:30 ml-serve-ctrl1001 kube-apiserver[441]: I0607 11:25:30.905215 441 trace.go:116] Trace[949378876]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/knative-serving/leases/webhook.configmapwebhook.00-of-01 (started: 2022-06-07 11:25:30.264235608 +0000 UTC m=+2775246.013967460) (total time: 640.881894ms): ... Jun 04 03:49:49 ml-serve-ctrl1001 kube-apiserver[441]: I0604 03:49:49.738868 441 trace.go:116] Trace[2120771801]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/knative-serving/leases/istio-webhook.defaultingwebhook.00-of-01 (started: 2022-06-04 03:49:48.853333504 +0000 UTC m=+2488704.603065301) (total time: 885.502384ms): ... Jun 06 14:55:15 ml-serve-ctrl1001 kube-apiserver[441]: I0606 14:55:15.572804 441 trace.go:116] Trace[174728748]: "Call mutating webhook" configuration:inferenceservice.serving.kserve.io,webhook:inferenceservice.kserve-webhook-server.defaulter,resource:serving.kserve.io/v1beta1, Resource=inferenceservices,subresource:,operation:UPDATE,UID:edd91fb4-1ae9-48c7-8472-a04ceda51a42 (started: 2022-06-06 14:55:14.255533552 +0000 UTC m=+2701430.005265387) (total time: 1.317200595s): ...
At the beginning I thought it was a problem related to DRDB and etcd (so higher latencies when fsyncing etc..) but after moving the ml-etcd1* cluster to a no replication scheme in Ganeti (so removing DRDB), the alerts are still firing.
We should investigate why this is happening.