Page MenuHomePhabricator

High API server request latencies (LIST)
Closed, ResolvedPublic

Description

[07.03.22 14:42] <icinga-wm> PROBLEM - k8s API server requests latencies on kubemaster1001 is CRITICAL: instance=10.64.0.117 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27

List calls to resources in istio API groups report high latency: Grafana

kubemaster1001 starts to trace actions taking longer than 3s:

Mar  7 14:24:58 kubemaster1001 kube-apiserver[493]: I0307 14:24:58.398676     493 trace.go:116] Trace[1185775778]: "List" url:/api/v1/secrets (started: 2022-03-07 14:24:57.669915292 +0000 UTC m=+3303833.739207327) (total time: 728.713224ms):
Mar  7 14:24:58 kubemaster1001 kube-apiserver[493]: Trace[1185775778]: [728.712147ms] [727.166459ms] Writing http response done count:372
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.349581     493 trace.go:116] Trace[659950693]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.349263071 +0000 UTC m=+3303833.418555111) (total time: 3.00024998s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[659950693]: [3.00024998s] [3.000186434s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.355050     493 trace.go:116] Trace[2009449441]: "List" url:/apis/networking.istio.io/v1alpha3/workloadgroups (started: 2022-03-07 14:24:57.348984747 +0000 UTC m=+3303833.418276786) (total time: 3.006000333s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[2009449441]: [3.006000333s] [3.005816051s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.400609     493 trace.go:116] Trace[824252976]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.400227197 +0000 UTC m=+3303833.469519183) (total time: 3.000317636s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[824252976]: [3.000317636s] [3.000309191s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.402158     493 trace.go:116] Trace[1125381388]: "List" url:/apis/security.istio.io/v1beta1/requestauthentications (started: 2022-03-07 14:24:57.400172745 +0000 UTC m=+3303833.469464718) (total time: 3.001942141s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[1125381388]: [3.001942141s] [3.001896904s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.490035     493 trace.go:116] Trace[520237178]: "cacher list" type:*unstructured.Unstructured (started: 2022-03-07 14:24:57.489798833 +0000 UTC m=+3303833.559090809) (total time: 3.000178657s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[520237178]: [3.000178657s] [3.000172598s] END
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: I0307 14:25:00.491527     493 trace.go:116] Trace[718436094]: "List" url:/apis/networking.istio.io/v1alpha3/sidecars (started: 2022-03-07 14:24:57.489718219 +0000 UTC m=+3303833.559010194) (total time: 3.001774056s):
Mar  7 14:25:00 kubemaster1001 kube-apiserver[493]: Trace[718436094]: [3.001774056s] [3.001713145s] END

kubemaster1002 logged an denied admission webhook request:

Mar  7 14:26:26 kubemaster1002 kube-apiserver[550]: W0307 14:26:26.355945     550 dispatcher.go:141] rejected by webhook "validation.istio.io": &errors.StatusError{ErrStatus:v1.Status{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ListMeta:v1.ListMeta{SelfLink:"", ResourceVersion:"", Continue:"", RemainingItemCount:(*int64)(nil)}, Status:"Failure", Message:"admission webhook \"validation.istio.io\" denied the request: configuration is invalid: gateway must have at least one server", Reason:"", Details:(*v1.StatusDetails)(nil), Code:400}}

We had a similar issue again today (2022-03-24), this time caused by LIST operations for objects from cert-manager API groups. Like the last time it started due to a connectivity issue to the apiserver (this time from a cert-manager pod, rater than istio):

E0324 13:40:02.622643       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Certificate: Get "https://kubernetes.default.svc.cluster.local:443/apis/cert-manager.io/v1/certificates?allowWatchBookmarks=true&resourceVersion=178338902&timeout=7m25s&timeoutSeconds=445&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution
...
E0324 13:40:02.622880       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Challenge: Get "https://kubernetes.default.svc.cluster.local:443/apis/acme.cert-manager.io/v1/challenges?allowWatchBookmarks=true&resourceVersion=179062568&timeout=7m18s&timeoutSeconds=438&watch=true": dial tcp: lookup kubernetes.default.svc.cluster.local: Temporary failure in name resolution
I0324 13:40:47.245159       1 trace.go:205] Trace[155471295]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.534) (total time: 43710ms):
Trace[155471295]: [43.710232348s] [43.710232348s] END
E0324 13:40:47.245215       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167: Failed to watch *v1.Order: failed to list *v1.Order: Timeout: Too large resource version: 179062547, current: 173837174
I0324 13:40:47.439111       1 trace.go:205] Trace[2126438358]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.21.3/tools/cache/reflector.go:167 (24-Mar-2022 13:40:03.703) (total time: 43735ms):
Trace[2126438358]: [43.735780788s] [43.735780788s] END

The Timeout: Too large resource version messages keep appearing over and over then and made me suspicious as I had seen those the last time (coming from istio) as well. Unfortunately they never made it to the task but are hidden behind a logstash link. Going from that similarity I went looking again and found: Kubelet gets "Timeout: Too large resource version" error from the API server after network outage #91073 which seems to have been mitigated in k8s client >=1.19 ( Fix bug in reflector not recovering from "Too large resource version"… #92537). Also the following KEP was created (have not looked in detail) to improve the situation in apiserver: KEP: Efficient watch resumption after kube-apiserver reboot #1878

Follow ups:

Event Timeline

Two interesting things logged on one of the istiod pods:

{"level":"error","time":"2022-03-07T14:24:56.360678Z","scope":"klog","msg":"error retrieving resource lock istio-system/istio-leader: Get \"https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader\": http2: client connection lost"}
{"level":"info","time":"2022-03-07T14:24:56.361427Z","scope":"klog","msg":"k8s.io/client-go@v0.20.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"}
[...last one repeated a lot of times...]

It appears as if the istiod control plane was stuck in some corner case, maybe it is a known bug if in some cases connectivity is lost between the controlplane and the kube-api?

Two interesting things logged on one of the istiod pods:

{"level":"error","time":"2022-03-07T14:24:56.360678Z","scope":"klog","msg":"error retrieving resource lock istio-system/istio-leader: Get \"https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader\": http2: client connection lost"}
{"level":"info","time":"2022-03-07T14:24:56.361427Z","scope":"klog","msg":"k8s.io/client-go@v0.20.1/tools/cache/reflector.go:167: watch of *v1.Namespace ended with: an error on the server (\"unable to decode an event from the watch stream: http2: client connection lost\") has prevented the request from succeeding"}
[...last one repeated a lot of times...]

It appears as if the istiod control plane was stuck in some corner case, maybe it is a known bug if in some cases connectivity is lost between the controlplane and the kube-api?

Logstash link for that: https://logstash.wikimedia.org/goto/84547de9db1985533f571171075fa88a

From there it seems as if istiod-69d679d8b5-hm64j went sideways, and istiod-69d679d8b5-m6klg lost leader election

Mar 7, 2022 @ 14:31:33.534 istiod-69d679d8b5-m6klg
ADS: "10.64.71.36:41866" router~10.64.71.36~istio-ingressgateway-n6cnv.istio-system~istio-system.svc.cluster.local-3742 terminated rpc error: code = Canceled desc = context canceled
Mar 7, 2022 @ 14:26:22.150 istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty
Mar 7, 2022 @ 14:26:22.107 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:22.106 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-validation-controller-election: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-validation-controller-election": context deadline exceeded
Mar 7, 2022 @ 14:26:22.076 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:22.038 istiod-69d679d8b5-m6klg
Error waiting for ingress
Mar 7, 2022 @ 14:26:22.037 istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty
Mar 7, 2022 @ 14:26:22.033 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-leader: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-leader": context deadline exceeded
Mar 7, 2022 @ 14:26:21.370 istiod-69d679d8b5-m6klg
Leader election cycle 1 lost. Trying again
Mar 7, 2022 @ 14:26:21.370 istiod-69d679d8b5-m6klg
error retrieving resource lock istio-system/istio-namespace-controller-election: Get "https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/istio-system/configmaps/istio-namespace-controller-election": context deadline exceeded
Mar 7, 2022 @ 14:26:21.328istiod-69d679d8b5-m6klg
Failed to release lock: resource name may not be empty

But the election config maps state that istiod-69d679d8b5-m6klg is holder since 2022-02-23T12:24:06Z (if I don't misinterpret acquireTime).

Mentioned in SAL (#wikimedia-operations) [2022-03-07T15:56:48Z] <jayme> eqiad: kubectl -n istio-system delete po istiod-69d679d8b5-hm64j - T303184

Killing istiod-69d679d8b5-hm64j actually brought the latency down again

Unfortunately I don't see istio pilot metrics exposing the issue.

I will resolve this, although I'm not sure what the cause of this was. The webhook rejection is a red herring because the istio pilot (webhook) was not able to list the endpoints at that time.
Visibility of this issue was okay (we got alerted via the k8s api lantency) and service was never interrupted (e.g. inressgateway was happy all the time).

JMeybohm renamed this task from High API server request latencies (LIST) for istio API groups to High API server request latencies (LIST) .Mar 24 2022, 3:26 PM
JMeybohm updated the task description. (Show Details)

Reopen as this is not resolved an will most likely hit us again until we update istio/cert-manager/kubernetes

JMeybohm triaged this task as Medium priority.Mar 24 2022, 3:28 PM

Started looking into this again because we saw elevated LIST latencies again starting from 2022-08-31 ~22:20Z in codfw and 2022-09-01 02:55Z in eqiad.

There are multiple alerts:

  • Elevated etcd request latencies (as seen by the kubernetes-apiserver): This still is <50ms on avg5m, I guess that is basically okay. But it's still more than double of what we usually have Grafana
  • Elevated Kubernetes API request duration (LIST) avg5m: This jumps from ~150ms to ~1s Grafana

The prime suspect for this is helm-state-metrics (T310714: Detect and alert on helm releases in unclean state) which does a LIST call for all helm secrets in the cluster. Those secrets are rather big (as they contain a tgz of the helm release as base64) and need to be at least unmashalled and extracted by the client - which takes some time (~5s p99 measured at the client). The kubernetes-apiserver logs confirm those calls taking ~1.3s measured on the server side:

Sep  2 16:24:38 kubemaster1002 kube-apiserver[437]: I0902 16:24:38.235643     437 trace.go:116] Trace[386325264]: "List" url:/api/v1/secrets (started: 2022-09-02 16:24:36.866868481 +0000 UTC m=+960628.458305335) (total time: 1.368709147s):
---
Sep  2 14:55:02 kubemaster1001 kube-apiserver[295]: I0902 14:55:02.542797     295 trace.go:116] Trace[1583766295]: "List" url:/api/v1/secrets (started: 222-09-02 14:55:01.395807851 +0000 UTC m=+956078.810277433) (total time: 1.146926633s):

While this is unarguably a long time, I think it's not really an issue as the kubernetes-apiserver still responds quickly to other LIST calls. Interesting is that these big spikes of avg5m do only happen when helm-state-metrics is connected to kubemaster[1-2]002,. The *001 equivalents, while report similar times in syslog, report an avg5m of ~150ms still.
Utilization and saturation on both servers are similar, same for the etcd backends and this is reproducible by switching helm-state-metrics back and forth between kubemaster*001 and *002 which can be seen in https://w.wiki/5f3T

FTR:

root@deploy1002:~# kube_env admin eqiad
root@deploy1002:~# time kubectl get secrets -A -l owner=helm -o yaml >/dev/null

real    0m4.919s
user    0m4.750s
sys     0m0.333s
root@deploy1002:~# time helm list -A > /dev/null

real    0m2.426s
user    0m2.155s
sys     0m0.169s
root@deploy1002:~# kube_env admin codfw
root@deploy1002:~# time kubectl get secrets -A -l owner=helm -o yaml >/dev/null

real    0m5.166s
user    0m4.640s
sys     0m0.308s
root@deploy1002:~# time helm list -A > /dev/null

real    0m2.611s
user    0m2.263s
sys     0m0.157s

So the culprit here is more the way we use/sum up the apiserver_request_latencies. The reason for the big spike in avg5m on specific api servers is that one of them is the leader for kube-controller-manager which issues a baseline of ~0.258 LIST req/s. On the other api server the LIST baseline is ~0.02 req/s. So when helm-state-metrics comes around and adds another 0.033 LIST req/s this has way more impact on the avg5m on the server which is not kube-controller-manager leader.

So the appropriate change is to longer use and avg5m over all LIST requests as the latency differs per api resource/object type quite a bit. Instead I think we should use the p99 per resource. Not exactly sure about proper thresholds, though. I will refactor the metrics we alert on as part of T311251: Migrate kubernetes alerts away from icinga - we need to touch that anyways because the metrics we currently used will be replaced in upcoming k8s versions.

I've also added a short TTL cache to helm-state-metrics two scrapes per minute (from our two prometheus instances) do no longer issue two k8s API calls.

Still keeping this open until updates k8s client libraries are used in cert-manager and istio to fix the "Too large resource version" issue.

I'm going to resolve this as we're updating all of the said clients with T307943: Update Kubernetes clusters to v1.23

Mentioned in SAL (#wikimedia-operations) [2022-12-15T10:34:08Z] <jayme> restarted istiod pods in aux-k8s because of T303184