At least the apiserver_request_latencies_summary metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the monitoring::check_prometheus alerts from profile::kubernetes::master to prometheus alerting rules.
The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really))
instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"}
Recording rule:
sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m])) / sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m]))
For: 5m
Thresholds: W: 0.4s, C: 0.85s
New metric name: kubelet_runtime_operations_duration_seconds
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228
Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days.
- Replace usage in Grafana
- Matched https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets (Kubernetes Kubelets)
- Matched https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node (Kubernetes node)
- Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/node.pp
- Remove recording rule from modules/profile/files/prometheus/rules_k8s.yml
scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m])))
For: 5m
Thresholds: W: 50, C: 100
New metric name: apiserver_request_total
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624
Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate.
- Replace usage in Grafana
- Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"}
Recording rule:
sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m]))
For: 5m
Thresholds: W: 0.2, C: 0.3
New metric name: apiserver_request_duration_seconds
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830637
- Replace usage in Grafana
- Replace usage in monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
- Remove recording rule from modules/profile/files/prometheus/rules_k8s.yml
instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"}
Recording rule:
sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m])) / sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m]))
For: 5m
Thresholds: W: 0.3, C: 0.5
New metric name: etcd_request_duration_seconds
Suggestion: I'm inclined to drop this from alerting altogether. apiserver_request_duration_seconds is the "user facing" metric here and it ultimately includes etcd_request_duration_seconds.
- Replace usage in Grafana
- Remove from monitoring::check_prometheus: modules/profile/manifests/kubernetes/master.pp
- Remove recording rule from modules/profile/files/prometheus/rules_k8s.yml