⚓ T311251 Migrate kubernetes alerts away from icinga

Subject	Repo	Branch	Lines +/-
Bump threshold for LIST secrets from 1.7s to 2s	operations/alerts	master	+3 -3
Use p95 instead of p99 for KubernetesAPILatency alerts	operations/alerts	master	+7 -7
prometheus: Remove obsolete recording rules	operations/puppet	production	+0 -10
kubernetes: Remove obsolete monitoring::check_prometheus resources	operations/puppet	production	+0 -49
Alert on high Kubernetes API latency	operations/alerts	master	+84 -0
Alert on high Kubernetes API error rate	operations/alerts	master	+36 -0
Alert on high lateny of kubelet operations	operations/alerts	master	+30 -1

Status	Assigned	Task
Open	None	T321808 Port most/all Icinga checks to Prometheus/Alertmanager
Open	None	T288622 All Prometheus based alerts move from Icinga to alert manager exclusively
Resolved	JMeybohm	T307943 Update Kubernetes clusters to v1.23
Resolved	JMeybohm	T311251 Migrate kubernetes alerts away from icinga

JMeybohm created this task.Jun 23 2022, 3:40 PM

JMeybohm renamed this task from Migrate alerts away from icinga to Migrate kubernetes apiserver alerts away from icinga .Jul 5 2022, 1:47 PM

JMeybohm renamed this task from Migrate kubernetes apiserver alerts away from icinga to Migrate kubernetes alerts away from icinga .Jul 11 2022, 1:48 PM

JMeybohm updated the task description. (Show Details)

JMeybohm merged a task: T312763: Port kubernetes prometheus-based alerts from icinga to alertmanager.

JMeybohm added a project: Observability-Alerting.

JMeybohm added subscribers: fgiunchedi, lmata.

fgiunchedi added a parent task: T288622: All Prometheus based alerts move from Icinga to alert manager exclusively.Jul 11 2022, 3:50 PM

I will take a look at this now as we need to review/refactor the alerts as part of T303184: High API server request latencies (LIST) anyways.

JMeybohm mentioned this in T303184: High API server request latencies (LIST) .Sep 6 2022, 2:43 PM

Change 830228 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert on high lateny of kubelet operations

https://gerrit.wikimedia.org/r/830228

gerritbot added a project: Patch-For-Review.Sep 6 2022, 5:39 PM

JMeybohm updated the task description. (Show Details)Sep 6 2022, 5:49 PM

lmata moved this task from Inbox to Prioritized on the Observability-Alerting board.Sep 6 2022, 7:01 PM

Change 830624 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert in high Kubernetes API error rate

https://gerrit.wikimedia.org/r/830624

JMeybohm updated the task description. (Show Details)Sep 7 2022, 12:33 PM

Change 830637 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert on high Kubernetes API latency

https://gerrit.wikimedia.org/r/830637

JMeybohm updated the task description. (Show Details)Sep 7 2022, 2:24 PM

Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days.

Old eventgate related readinessProbe alert. We excluded it in order to not bump the thresholds. Then we bumped the thresholds and it no longer applies (and it doesn't for a very long while).

Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate.

+1 on the suggestion. No point in alerting for what should organically increase as the platform is adopted more.

Suggestion: I'm inclined to drop this from alerting altogether. apiserver_request_duration_seconds is the "user facing" metric here and it ultimately includes etcd_request_duration_seconds.

+1

Change 830643 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kuberneste: Remove obsolete monitoring::check_prometheus resources

https://gerrit.wikimedia.org/r/830643

Change 830644 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus: Remove obsolete recording rules

https://gerrit.wikimedia.org/r/830644

Change 830228 merged by jenkins-bot:

[operations/alerts@master] Alert on high lateny of kubelet operations

https://gerrit.wikimedia.org/r/830228

Change 830624 merged by jenkins-bot:

[operations/alerts@master] Alert on high Kubernetes API error rate

https://gerrit.wikimedia.org/r/830624

Change 830637 merged by jenkins-bot:

[operations/alerts@master] Alert on high Kubernetes API latency

https://gerrit.wikimedia.org/r/830637

Change 830643 merged by JMeybohm:

[operations/puppet@production] kubernetes: Remove obsolete monitoring::check_prometheus resources

https://gerrit.wikimedia.org/r/830643

Change 830644 merged by JMeybohm:

[operations/puppet@production] prometheus: Remove obsolete recording rules

https://gerrit.wikimedia.org/r/830644

JMeybohm closed this task as Resolved.Sep 12 2022, 9:52 AM

JMeybohm updated the task description. (Show Details)

fgiunchedi awarded a token.Sep 12 2022, 10:42 AM

Unfortunately KubernetesAPILatency fires from time to time. Ordered by cluster and timestamp:

[2022-09-13 14:09:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw
[2022-09-13 13:21:28] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 13:14:13] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 13:08:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 12:26:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 11:12:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 11:06:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw
[2022-09-13 10:43:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw
[2022-09-13 08:32:58] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw
[2022-09-13 08:27:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 03:09:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw

[2022-09-13 10:31:58] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw
[2022-09-13 10:26:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s@codfw
[2022-09-12 00:12:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw

[2022-09-12 12:46:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad
[2022-09-11 21:18:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s@eqiad
[2022-09-10 13:37:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad
[2022-09-10 09:16:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad

On k8s-mlstaging@codfw there is probably something going on, I have not yet figured out what.
The k8s@codfw falls together with a cert-manager reconcile run taking way more time than usual (due to high throttling I suppose) plus oomk of cert-manager cainjector (following leader election etc.).
k8s@eqiad I've not checked in detail.

As the spikes are pretty high (in the area of multiple seconds), raising the duration threshold is probably not a good idea. So we could:

Increase the evaluation period (from 5m to 10m?)
Use p95 instead of p99 for alerting

Both changes would have avoided the above alerts (including most of the alerts for ml-staging, but not all). Opinions?

Change 835637 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Use p95 instead of p99 for KubernetesAPILatency alerts

https://gerrit.wikimedia.org/r/835637

Changed the alerts from using p99 to using p95, resolving this again.

Change 835637 merged by jenkins-bot:

[operations/alerts@master] Use p95 instead of p99 for KubernetesAPILatency alerts

https://gerrit.wikimedia.org/r/835637

Change 840837 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Bump threshold for LIST secrets from 1.7s to 2s

https://gerrit.wikimedia.org/r/840837

Change 840837 merged by jenkins-bot:

[operations/alerts@master] Bump threshold for LIST secrets from 1.7s to 2s

https://gerrit.wikimedia.org/r/840837

lmata moved this task from Prioritized to Done on the Observability-Alerting board.Jan 16 2023, 5:57 PM

Migrate kubernetes alerts away from icinga
Closed, ResolvedPublic
Actions

Description

instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"}

scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m])))

instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"}

instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"}

Details

Related Objects
Search...

Event Timeline

Migrate kubernetes alerts away from icinga Closed, ResolvedPublicActions