Change Details

At least the `apiserver_request_latencies_summary` metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the `monitoring::check_prometheus` alerts from `profile::kubernetes::master` to prometheus alerting rules. The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really)) ===== instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"} ===== Recording rule: ``` sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m])) / sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m])) ``` For: 5m Thresholds: W: 0.4s, C: 0.85s New metric name: `kubelet_runtime_operations_duration_seconds` I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days. Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228 [x] Replace usage in Grafana * Matched https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets (Kubernetes Kubelets) * Matched https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node (Kubernetes node) [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/node.pp` ===== scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m]))) ===== For: 5m Thresholds: W: 50, C: 100 New metric name: `apiserver_request_total` I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate. Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624 [x] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` ===== instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m])) ``` New metric name: `apiserver_request_duration_seconds` [] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` ===== instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m])) ``` New metric name: `etcd_request_duration_seconds` [] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp

At least the `apiserver_request_latencies_summary` metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the `monitoring::check_prometheus` alerts from `profile::kubernetes::master` to prometheus alerting rules. The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really)) ===== instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"} ===== Recording rule: ``` sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m])) / sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m])) ``` For: 5m Thresholds: W: 0.4s, C: 0.85s New metric name: `kubelet_runtime_operations_duration_seconds` Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228 Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days. [x] Replace usage in Grafana * Matched https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets (Kubernetes Kubelets) * Matched https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node (Kubernetes node) [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/node.pp` [] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml` ===== scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m]))) ===== For: 5m Thresholds: W: 50, C: 100 New metric name: `apiserver_request_total` Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624 Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate. [x] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` ===== instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m])) ``` For: 5m Thresholds: W: 0.2, C: 0.3 New metric name: `apiserver_request_duration_seconds` Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830637 [x] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` [] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml` ===== instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m])) ``` For: 5m Thresholds: W: 0.3, C: 0.5 New metric name: `etcd_request_duration_seconds` Suggestion: I'm inclined to drop this from alerting altogether. `apiserver_request_duration_seconds` is the "user facing" metric here and it ultimately includes `etcd_request_duration_seconds`. [x] Replace usage in Grafana [] Remove from monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` [] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml`

At least the `apiserver_request_latencies_summary` metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the `monitoring::check_prometheus` alerts from `profile::kubernetes::master` to prometheus alerting rules. The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really)) ===== instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"} ===== Recording rule: ``` sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m])) / sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m])) ``` For: 5m Thresholds: W: 0.4s, C: 0.85s New metric name: `kubelet_runtime_operations_duration_seconds` I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days. Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228 Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days. [x] Replace usage in Grafana * Matched https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets (Kubernetes Kubelets) * Matched https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node (Kubernetes node) [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/node.pp` [] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml` ===== scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m]))) ===== For: 5m Thresholds: W: 50, C: 100 New metric name: `apiserver_request_total` I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate. Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624 Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate. [x] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` ===== instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m])) ``` For: 5m Thresholds: W: 0.2, C: 0.3 New metric name: `apiserver_request_duration_seconds` Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830637 [x] Replace usage in Grafana [] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` [] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml` ===== instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"} ===== Recording rule: ``` sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m])) ``` For: 5m Thresholds: W: 0.3, C: 0.5 New metric name: `etcd_request_duration_seconds` Suggestion: I'm inclined to drop this from alerting altogether. `apiserver_request_duration_seconds` is the "user facing" metric here and it ultimately includes `etcd_request_duration_seconds`. [x] Replace usage in Grafanaafana [] Remove from monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp` [] Replace usage in monitoring::check_prometheus: Remove recording rule from `modules/profile/manifests/kubernetes/master.ppfiles/prometheus/rules_k8s.yml`