[07.03.22 14:42] <icinga-wm> PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27
List calls to resources in istio API groups report high latency: Grafana
kubemaster1001 starts to trace actions taking longer than 3s:
Mar 7 14:24:58 kubemaster1001 kube-apiserver[493]: I0307 14:24:58.398676 493 trace.go:116] Trace[1185775778]: "List" url:/api/v1/secrets (started: 2022-03-07 14:24:57.669915292 +0000 UTC m=+3303833.739207327) (total time: 728.713224ms): Mar 7 14:24:58 kubemaster1001 kube-apiserver[493]: Trace[1185775778]: [728.712147ms] [727.166459ms] Writing http response done count:372 Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.349581 493 trace.go:116] Trace[659950693]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.349263071 +0000 UTC m=+3303833.418555111) (total time: 3.00024998s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[659950693]: [3.00024998s] [3.000186434s] END Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.355050 493 trace.go:116] Trace[2009449441]: "List" url:/apis/networking.istio.io/v1alpha3/workloadgroups (started: 2022-03-07 14:24:57.348984747 +0000 UTC m=+3303833.418276786) (total time: 3.006000333s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[2009449441]: [3.006000333s] [3.005816051s] END Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.400609 493 trace.go:116] Trace[824252976]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.400227197 +0000 UTC m=+3303833.469519183) (total time: 3.000317636s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[824252976]: [3.000317636s] [3.000309191s] END Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.402158 493 trace.go:116] Trace[1125381388]: "List" url:/apis/security.istio.io/v1beta1/requestauthentications (started: 2022-03-07 14:24:57.400172745 +0000 UTC m=+3303833.469464718) (total time: 3.001942141s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[1125381388]: [3.001942141s] [3.001896904s] END Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.490035 493 trace.go:116] Trace[520237178]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.489798833 +0000 UTC m=+3303833.559090809) (total time: 3.000178657s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[520237178]: [3.000178657s] [3.000172598s] END Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.491527 493 trace.go:116] Trace[718436094]: "List" url:/apis/networking.istio.io/v1alpha3/sidecars (started: 2022-03-07 14:24:57.489718219 +0000 UTC m=+3303833.559010194) (total time: 3.001774056s): Mar 7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[718436094]: [3.001774056s] [3.001713145s] END
kubemaster1002 logged an denied admission webhook request:
Mar 7 14:26:26 kubemaster1002 kube-apiserver[550]: W0307 14:26:26.355945 550 dispatcher.go:141] rejected by webhook "validation.istio.io": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.istio.io\" denied the request: configuration is invalid: gateway must have at least one server", Reason:"", Details:(*v1.StatusDetails)(nil), Code:400}}
We had a similar issue again today (2022-03-24), this time caused by LIST operations for objects from cert-manager API groups. Like the last time it started due to a connectivity issue to the apiserver (this time from a cert-manager pod, rater than istio):
E0324 13:40:02.622643 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Certificate: Get "https://kubernetes.default.svc.cluster.local:443/apis/cert-manager.io/v1/certificates?allowWatchBookmarks=true&resourceVersion=178338902&timeout=7m25s&timeoutSeconds=445&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution ... E0324 13:40:02.622880 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Challenge: Get "https://kubernetes.default.svc.cluster.local:443/apis/acme.cert-manager.io/v1/challenges?allowWatchBookmarks=true&resourceVersion=179062568&timeout=7m18s&timeoutSeconds=438&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution I0324 13:40:47.245159 1 trace.go:205] Trace[155471295]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.534) (total time: 43710ms): Trace[155471295]: [43.710232348s] [43.710232348s] END E0324 13:40:47.245215 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Order: failed to list *v1.Order: Timeout: Too large resource version: 179062547, current: 173837174 I0324 13:40:47.439111 1 trace.go:205] Trace[2126438358]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.703) (total time: 43735ms): Trace[2126438358]: [43.735780788s] [43.735780788s] END
The Timeout: Too large resource version messages keep appearing over and over then and made me suspicious as I had seen those the last time (coming from istio) as well. Unfortunately they never made it to the task but are hidden behind a logstash link. Going from that similarity I went looking again and found: Kubelet gets "Timeout: Too large resource version" error from the API server after network outage #91073 which seems to have been mitigated in k8s client >=1.19 ( Fix bug in reflector not recovering from "Too large resource version"… #92537). Also the following KEP was created (have not looked in detail) to improve the situation in apiserver: KEP: Efficient watch resumption after kube-apiserver reboot #1878
Follow ups:
- Provide better insights on latency (groups/resources) in https://grafana.wikimedia.org/d/000000435/kubernetes-api
- Improve istio-control-plane-dashboard, there was no issue visible there