At least the `apiserver_request_latencies_summary` metric has been turned off with k8s 1.17. We should take the chance to revisit the need for the recording rules in modules/profile/files/prometheus/rules_k8s.yml and migrate the `monitoring::check_prometheus` alerts from `profile::kubernetes::master` to prometheus alerting rules.
The following alerts are based on check_prometheus and will benefit from being ported to alertmanager / alerts.git. (cfr docs at https://wikitech.wikimedia.org/wiki/Alertmanager#Create_alerts (and the whole page really))
===== instance_operation_type:kubelet_runtime_operations_latency_microseconds:avg5m{instance=\"${::fqdn}\"} =====
Recording rule:
```
sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_sum{job="k8s-node", operation_type!="exec_sync"}[5m]))
/ sum by(instance) (rate(kubelet_runtime_operations_latency_microseconds_count{job="k8s-node", operation_type!="exec_sync"}[5m]))
```
For: 5m
Thresholds: W: 0.4s, C: 0.85s
New metric name: `kubelet_runtime_operations_duration_seconds`
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830228
Note: I don't understand why exec_sync is excluded here as that does not go above 240ms over the last 90days.
[x] Replace usage in Grafana
* Matched https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets (Kubernetes Kubelets)
* Matched https://grafana.wikimedia.org/d/G8zPL7-Wz/kubernetes-node (Kubernetes node)
[] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/node.pp`
[] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml`
===== scalar(sum(rate(apiserver_request_count{instance=\"${::ipaddress}:6443\"}[5m]))) =====
For: 5m
Thresholds: W: 50, C: 100
New metric name: `apiserver_request_total`
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830624
Note: I'm not sure this makes sense at all as an alert. mlserve is constantly in warning as that's around 70 req/s - and that does not even sound like much. I would suggest we instead alert for an elevated error rate.
[x] Replace usage in Grafana
[] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp`
===== instance_verb:apiserver_request_latencies_summary:avg5m{verb\\!~\"(CONNECT|WATCH|WATCHLIST)\",instance=\"${::ipaddress}:6443\"} =====
Recording rule:
```
sum by(instance, verb) (rate(apiserver_request_latencies_summary_sum{job="k8s-api"}[5m]))
/ sum by(instance, verb) (rate(apiserver_request_latencies_summary_count{job="k8s-api"}[5m]))
```
For: 5m
Thresholds: W: 0.2, C: 0.3
New metric name: `apiserver_request_duration_seconds`
Suggestion: https://gerrit.wikimedia.org/r/c/operations/alerts/+/830637
[x] Replace usage in Grafana
[] Replace usage in monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp`
[] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml`
===== instance_operation:etcd_request_latencies_summary:avg5m{instance=\"${::ipaddress}:6443\"} =====
Recording rule:
```
sum by(instance, operation) (rate(etcd_request_latencies_summary_sum{job="k8s-api"}[5m]))
/ sum by(instance, operation) (rate(etcd_request_latencies_summary_count{job="k8s-api"}[5m]))
```
For: 5m
Thresholds: W: 0.3, C: 0.5
New metric name: `etcd_request_duration_seconds`
Suggestion: I'm inclined to drop this from alerting altogether. `apiserver_request_duration_seconds` is the "user facing" metric here and it ultimately includes `etcd_request_duration_seconds`.
[x] Replace usage in Grafana
[] Remove from monitoring::check_prometheus: `modules/profile/manifests/kubernetes/master.pp`
[] Remove recording rule from `modules/profile/files/prometheus/rules_k8s.yml`