This is a follow up on T419457, from an alerting/monitoring perspective.
<+jinxer-wm> FIRING: [3x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown <+jinxer-wm> FIRING: [4x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
This is a paging alert, which depending on the time/shift can also wake up an SRE on on-call. It seems that @Volans and @bking worked together on resolving this (please correct me if I am wrong).
We discussed this briefly in the SRE meeting today and have two immediate follow-ups:
- Should this have been a paging alert?
- If yes, it does raise the question of ownership of this alert since most (all?) non-DP SREs probably do not have the requisite knowledge to either debug or implement a fix for this alert. Given that DP SREs are not part of on-call, this raises the question of the ownership of this and other such alerts that are paging from DP's perspective but need to be handled by the on-call SRE(s).
I am taking the liberty to add @Gehel and @bking for the initial discussion, as the manager and the responder on the last incident respectively, from DP SRE.