Page MenuHomePhabricator

Migrate kubernetes alerts away from icinga
Closed, ResolvedPublic

Description

At least the apiserver_request_latencies_summary metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the monitoring::check_prometheus alerts from profile::kubernetes::master to prometheus alerting rules.

The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really))

instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"}

Recording rule:

sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m]))
  / sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m]))

For: 5m
Thresholds: W: 0.4s, C: 0.85s
New metric name: kubelet_runtime_operations_duration_seconds
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228
Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days.

scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m])))

For: 5m
Thresholds: W: 50, C: 100
New metric name: apiserver_request_total
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624
Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate.

  • Replace usage in Grafana
  • Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"}

Recording rule:

sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m]))
  / sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m]))

For: 5m
Thresholds: W: 0.2, C: 0.3
New metric name: apiserver_request_duration_seconds
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830637

  • Replace usage in Grafana
  • Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
  • Remove recording rule from modules/profile/files/prometheus/rules_k8s.yml
instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"}

Recording rule:

sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m]))
  / sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m]))

For: 5m
Thresholds: W: 0.3, C: 0.5
New metric name: etcd_request_duration_seconds
Suggestion: I'm inclined to drop this from alerting altogether. apiserver_request_duration_seconds is the "user facing" metric here and it ultimately includes etcd_request_duration_seconds.

  • Replace usage in Grafana
  • Remove from monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
  • Remove recording rule from modules/profile/files/prometheus/rules_k8s.yml

Event Timeline

JMeybohm renamed this task from Migrate alerts away from icinga to Migrate kubernetes apiserver alerts away from icinga .Jul 5 2022, 1:47 PM
JMeybohm renamed this task from Migrate kubernetes apiserver alerts away from icinga to Migrate kubernetes alerts away from icinga .Jul 11 2022, 1:48 PM
JMeybohm updated the task description. (Show Details)
JMeybohm added subscribers: fgiunchedi, lmata.

I will take a look at this now as we need to review/refactor the alerts as part of T303184: High API server request latencies (LIST) anyways.

Change 830228 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert on high lateny of kubelet operations

https://gerrit.wikimedia.org/r/830228

Change 830624 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert in high Kubernetes API error rate

https://gerrit.wikimedia.org/r/830624

Change 830637 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Alert on high Kubernetes API latency

https://gerrit.wikimedia.org/r/830637

Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days.

Old eventgate related readinessProbe alert. We excluded it in order to not bump the thresholds. Then we bumped the thresholds and it no longer applies (and it doesn't for a very long while).

Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate.

+1 on the suggestion. No point in alerting for what should organically increase as the platform is adopted more.

Suggestion: I'm inclined to drop this from alerting altogether. apiserver_request_duration_seconds is the "user facing" metric here and it ultimately includes etcd_request_duration_seconds.

+1

Change 830643 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] kuberneste: Remove obsolete monitoring::check_prometheus resources

https://gerrit.wikimedia.org/r/830643

Change 830644 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] prometheus: Remove obsolete recording rules

https://gerrit.wikimedia.org/r/830644

Change 830228 merged by jenkins-bot:

[operations/alerts@master] Alert on high lateny of kubelet operations

https://gerrit.wikimedia.org/r/830228

Change 830624 merged by jenkins-bot:

[operations/alerts@master] Alert on high Kubernetes API error rate

https://gerrit.wikimedia.org/r/830624

Change 830637 merged by jenkins-bot:

[operations/alerts@master] Alert on high Kubernetes API latency

https://gerrit.wikimedia.org/r/830637

Change 830643 merged by JMeybohm:

[operations/puppet@production] kubernetes: Remove obsolete monitoring::check_prometheus resources

https://gerrit.wikimedia.org/r/830643

Change 830644 merged by JMeybohm:

[operations/puppet@production] prometheus: Remove obsolete recording rules

https://gerrit.wikimedia.org/r/830644

JMeybohm updated the task description. (Show Details)

Unfortunately KubernetesAPILatency fires from time to time. Ordered by cluster and timestamp:

[2022-09-13 14:09:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw
[2022-09-13 13:21:28] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 13:14:13] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 13:08:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s-mlstaging@codfw
[2022-09-13 12:26:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 11:12:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 11:06:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw
[2022-09-13 10:43:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET deployments) on k8s-mlstaging@codfw
[2022-09-13 08:32:58] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET namespaces) on k8s-mlstaging@codfw
[2022-09-13 08:27:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw
[2022-09-13 03:09:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET services) on k8s-mlstaging@codfw

[2022-09-13 10:31:58] <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST jobs) on k8s@codfw
[2022-09-13 10:26:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on k8s@codfw
[2022-09-12 00:12:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw

[2022-09-12 12:46:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s@eqiad
[2022-09-11 21:18:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (GET namespaces) on k8s@eqiad
[2022-09-10 13:37:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad
[2022-09-10 09:16:58] <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad

On k8s-mlstaging@codfw there is probably something going on, I have not yet figured out what.
The k8s@codfw falls together with a cert-manager reconcile run taking way more time than usual (due to high throttling I suppose) plus oomk of cert-manager cainjector (following leader election etc.).
k8s@eqiad I've not checked in detail.

As the spikes are pretty high (in the area of multiple seconds), raising the duration threshold is probably not a good idea. So we could:

  • Increase the evaluation period (from 5m to 10m?)
  • Use p95 instead of p99 for alerting

Both changes would have avoided the above alerts (including most of the alerts for ml-staging, but not all). Opinions?

Change 835637 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Use p95 instead of p99 for KubernetesAPILatency alerts

https://gerrit.wikimedia.org/r/835637

Changed the alerts from using p99 to using p95, resolving this again.

Change 835637 merged by jenkins-bot:

[operations/alerts@master] Use p95 instead of p99 for KubernetesAPILatency alerts

https://gerrit.wikimedia.org/r/835637

Change 840837 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/alerts@master] Bump threshold for LIST secrets from 1.7s to 2s

https://gerrit.wikimedia.org/r/840837

Change 840837 merged by jenkins-bot:

[operations/alerts@master] Bump threshold for LIST secrets from 1.7s to 2s

https://gerrit.wikimedia.org/r/840837