Page MenuHomePhabricator

Data Platform SRE paging alerts and on-call SRE response
Open, MediumPublic

Description

This is a follow up on T419457, from an alerting/monitoring perspective.

<+jinxer-wm> FIRING: [3x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
<+jinxer-wm> FIRING: [4x] ProbeDown: Service dse-k8s-ctrl1001:6443 has failed probes (http_dse_k8s_eqiad_kube_apiserver_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

This is a paging alert, which depending on the time/shift can also wake up an SRE on on-call. It seems that @Volans and @bking worked together on resolving this (please correct me if I am wrong).

We discussed this briefly in the SRE meeting today and have two immediate follow-ups:

  1. Should this have been a paging alert?
  2. If yes, it does raise the question of ownership of this alert since most (all?) non-DP SREs probably do not have the requisite knowledge to either debug or implement a fix for this alert. Given that DP SREs are not part of on-call, this raises the question of the ownership of this and other such alerts that are paging from DP's perspective but need to be handled by the on-call SRE(s).

I am taking the liberty to add @Gehel and @bking for the initial discussion, as the manager and the responder on the last incident respectively, from DP SRE.

Event Timeline

ssingh triaged this task as Medium priority.Mar 16 2026, 7:02 PM

I don't think this alert should have been paging. The workloads we run on k8s are all supposed to be able to be down for extended periods.

I don't think this alert should have been paging. The workloads we run on k8s are all supposed to be able to be down for extended periods.

With Turnilo is being moved to dse-k8s this doesn't appear to be true any longer?

One more axis to consider: Best-practices-wise, for alerting on Kubernetes platforms, there's a distinction between control plane and data plane.

If the API server is unavailable (as it was here) you can't schedule new work, but everything that's running will keep on serving. You wouldn't want to stay that way indefinitely -- for example, if a worker node also goes down subsequently, its workload won't get rescheduled to another node until the API server comes back. But it's not uncommon for the Kubernetes control plane to have a lower availability SLO than some of the work scheduled on it.

(In this case, with that work mostly carrying a 95% SLO, we probably don't want to go lower. But we might not need to go higher.)

Thanks ever so much for getting this conversation started. I think that it's really important for us to get a good consensus on this, as well as a good technical solution.

It strikes me that there is some cross-over with the discussions that we're having on T398073: Ensure DPE SRE can receive alerts for applications hosted in wikikube and some possible improvements to the ways in which we currently route alerts.

I think that there are two aspects under discussion on that ticket, which are relevant:

  1. Default routing of kubernetes related alerts to the team who owns that cluster.
  2. Routing of specific services on any given cluster to a team other than the team who owns that cluster.

So in the case of this incident about the kubernetes control plane for the dse-k8s-eqiad cluster, I don't think that the alert should have been routed to team-sre by alertmanager.
I think it should have been routed to team-data-platform.

However, the alert is currently defined in puppet:///profile::kubernetes::master and the team is set to sre for all clusters.

This is similar to the problem where we have alerts defined in alerts/team-sre but wish to override the desintation team to team-data-platform.
It's not currently easy to do that without duplicating the alert definition and filtering on the cluster we want.

So I think that these are all instances of 1) above. I think that this is the most important to get addressed.

Change #1256287 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Route dse-k8s API blackbox checks to team-data-platform

https://gerrit.wikimedia.org/r/1256287

Change #1256287 merged by Btullis:

[operations/puppet@production] Route dse-k8s API blackbox checks to team-data-platform

https://gerrit.wikimedia.org/r/1256287