Page MenuHomePhabricator

High API server request latencies (LIST)
Open, MediumPublic

Description

[07.03.22 14:42] <icinga-wm> PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27

List calls to resources in istio API groups report high latency: Grafana

kubemaster1001 starts to trace actions taking longer than 3s:

Mar  7 14:24:58 kubemaster1001 kube-apiserver[493]: I0307 14:24:58.398676     493 trace.go:116] Trace[1185775778]: "List" url:/api/v1/secrets (started: 2022-03-07 14:24:57.669915292 +0000 UTC m=+3303833.739207327) (total time: 728.713224ms):
Mar  7 14:24:58 kubemaster1001 kube-apiserver[493]: Trace[1185775778]: [728.712147ms] [727.166459ms] Writing http response done count:372
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.349581     493 trace.go:116] Trace[659950693]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.349263071 +0000 UTC m=+3303833.418555111) (total time: 3.00024998s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[659950693]: [3.00024998s] [3.000186434s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.355050     493 trace.go:116] Trace[2009449441]: "List" url:/apis/networking.istio.io/v1alpha3/workloadgroups (started: 2022-03-07 14:24:57.348984747 +0000 UTC m=+3303833.418276786) (total time: 3.006000333s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[2009449441]: [3.006000333s] [3.005816051s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.400609     493 trace.go:116] Trace[824252976]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.400227197 +0000 UTC m=+3303833.469519183) (total time: 3.000317636s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[824252976]: [3.000317636s] [3.000309191s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.402158     493 trace.go:116] Trace[1125381388]: "List" url:/apis/security.istio.io/v1beta1/requestauthentications (started: 2022-03-07 14:24:57.400172745 +0000 UTC m=+3303833.469464718) (total time: 3.001942141s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[1125381388]: [3.001942141s] [3.001896904s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.490035     493 trace.go:116] Trace[520237178]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.489798833 +0000 UTC m=+3303833.559090809) (total time: 3.000178657s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[520237178]: [3.000178657s] [3.000172598s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.491527     493 trace.go:116] Trace[718436094]: "List" url:/apis/networking.istio.io/v1alpha3/sidecars (started: 2022-03-07 14:24:57.489718219 +0000 UTC m=+3303833.559010194) (total time: 3.001774056s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[718436094]: [3.001774056s] [3.001713145s] END

kubemaster1002 logged an denied admission webhook request:

Mar  7 14:26:26 kubemaster1002 kube-apiserver[550]: W0307 14:26:26.355945     550 dispatcher.go:141] rejected by webhook "validation.istio.io": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.istio.io\" denied the request: configuration is invalid: gateway must have at least one server", Reason:"", Details:(*v1.StatusDetails)(nil), Code:400}}

We had a similar issue again today (2022-03-24), this time caused by LIST operations for objects from cert-manager API groups. Like the last time it started due to a connectivity issue to the apiserver (this time from a cert-manager pod, rater than istio):

E0324 13:40:02.622643       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Certificate: Get "https://kubernetes.default.svc.cluster.local:443/apis/cert-manager.io/v1/certificates?allowWatchBookmarks=true&resourceVersion=178338902&timeout=7m25s&timeoutSeconds=445&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution
...
E0324 13:40:02.622880       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Challenge: Get "https://kubernetes.default.svc.cluster.local:443/apis/acme.cert-manager.io/v1/challenges?allowWatchBookmarks=true&resourceVersion=179062568&timeout=7m18s&timeoutSeconds=438&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution
I0324 13:40:47.245159       1 trace.go:205] Trace[155471295]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.534) (total time: 43710ms):
Trace[155471295]: [43.710232348s] [43.710232348s] END
E0324 13:40:47.245215       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Order: failed to list *v1.Order: Timeout: Too large resource version: 179062547, current: 173837174
I0324 13:40:47.439111       1 trace.go:205] Trace[2126438358]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.703) (total time: 43735ms):
Trace[2126438358]: [43.735780788s] [43.735780788s] END

The Timeout: Too large resource version messages keep appearing over and over then and made me suspicious as I had seen those the last time (coming from istio) as well. Unfortunately they never made it to the task but are hidden behind a logstash link. Going from that similarity I went looking again and found: Kubelet gets "Timeout: Too large resource version" error from the API server after network outage #91073 which seems to have been mitigated in k8s client >=1.19 ( Fix bug in reflector not recovering from "Too large resource version"… #92537). Also the following KEP was created (have not looked in detail) to improve the situation in apiserver: KEP: Efficient watch resumption after kube-apiserver reboot #1878

Follow ups:

Event Timeline

Two interesting things logged on one of the istiod pods:

{"level":"error","time":"2022-03-07T14:24:56.360678Z","scope":"klog","msg":"error retrieving resource lock istio-system/istio-leader: Get \"https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader\": http2: client connection lost"}
{"level":"info","time":"2022-03-07T14:24:56.361427Z","scope":"klog","msg":"k8s.io/client-go@v0.20.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"}
[...last one repeated a lot of times...]

It appears as if the istiod control plane was stuck in some corner case, maybe it is a known bug if in some cases connectivity is lost between the controlplane and the kube-api?

Two interesting things logged on one of the istiod pods:

{"level":"error","time":"2022-03-07T14:24:56.360678Z","scope":"klog","msg":"error retrieving resource lock istio-system/istio-leader: Get \"https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader\": http2: client connection lost"}
{"level":"info","time":"2022-03-07T14:24:56.361427Z","scope":"klog","msg":"k8s.io/client-go@v0.20.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"}
[...last one repeated a lot of times...]

It appears as if the istiod control plane was stuck in some corner case, maybe it is a known bug if in some cases connectivity is lost between the controlplane and the kube-api?

Logstash link for that: https://logstash.wikimedia.org/goto/84547de9db1985533f571171075fa88a

From there it seems as if istiod-69d679d8b5-hm64j went sideways, and istiod-69d679d8b5-m6klg lost leader election

Mar 7, 2022 @ 14:31:33.534 istiod-69d679d8b5-m6klg
ADS: "10.64.71.36:41866" router~10.64.71.36~istio-ingressgateway-n6cnv.istio-system~istio-system.svc.cluster.local-3742 terminated rpc error: code = Canceled desc = context canceled
Mar 7, 2022 @ 14:26:22.150 istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty
Mar 7, 2022 @ 14:26:22.107 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:22.106 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-validation-controller-election: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-validation-controller-election": context deadline exceeded
Mar 7, 2022 @ 14:26:22.076 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:22.038 istiod-69d679d8b5-m6klg
Error waiting for ingress
Mar 7, 2022 @ 14:26:22.037 istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty
Mar 7, 2022 @ 14:26:22.033 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-leader: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader": context deadline exceeded
Mar 7, 2022 @ 14:26:21.370 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:21.370 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-namespace-controller-election: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-namespace-controller-election": context deadline exceeded
Mar 7, 2022 @ 14:26:21.328istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty

But the election config maps state that istiod-69d679d8b5-m6klg is holder since 2022-02-23T12:24:06Z (if I don't misinterpret acquireTime).

Mentioned in SAL (#wikimedia-operations) [2022-03-07T15:56:48Z] <jayme> eqiad: kubectl -n istio-system delete po istiod-69d679d8b5-hm64j - T303184

Killing istiod-69d679d8b5-hm64j actually brought the latency down again

Unfortunately I don't see istio pilot metrics exposing the issue.

I will resolve this, although I'm not sure what the cause of this was. The webhook rejection is a red herring because the istio pilot (webhook) was not able to list the endpoints at that time.
Visibility of this issue was okay (we got alerted via the k8s api lantency) and service was never interrupted (e.g. inressgateway was happy all the time).

JMeybohm renamed this task from High API server request latencies (LIST) for istio API groups to High API server request latencies (LIST) .Mar 24 2022, 3:26 PM
JMeybohm updated the task description. (Show Details)

Reopen as this is not resolved an will most likely hit us again until we update istio/cert-manager/kubernetes

JMeybohm triaged this task as Medium priority.Mar 24 2022, 3:28 PM