Page MenuHomePhabricator

[Epic] Review alerting strategy for Data Platform SRE
Open, HighPublic

Description

The newly formed Data Platform SRE team has alerts in spread out in numerous systems, without clear strategy of how to move forward. In particular, alerts are sent via different channels (email, dashboards, IRC, etc...) and are routed without a clear intent (sometimes to DPE SRE, sometimes to the teams we support).

AC:

  • list the various types of alerts that we currently have
  • document our intentions around alerting see the Alerts Review Google doc
  • identify which alerts needs to be routed where
  • ensure that all DPE SREs are subscribed to the appropriate channels
  • align our reality with whatever we decided

Related Objects

Event Timeline

Per today's Data Platform SRE meeting, I've committed to lead this effort. To that end, I've added a sheet to the Data Platform SRE Google Sheet (sorry if this is the wrong place, we can always move it later though).

I'll solicit some further feedback, create some subtasks and update.

lmata subscribed.

Moving to radar to keep an eye out in case you need our help. Thanks!

The essay "My Philosophy on Alerting" was linked in the Linuxfoundation training @RKemper and I are taking. I haven't read it yet, but it's probably worth reading if one of the creators of Prometheus thinks it's important.

Gehel triaged this task as High priority.Oct 11 2023, 8:47 AM
Gehel moved this task from Incoming to Epics on the Data-Platform-SRE board.

This approach discussed with @Gehel at our 1x1 meeting:

  • Start with an audit of current alerts - specifics not required, just types of alerts
  • Tie those to SLO / SLI, do we want SLO / SLI?
  • How to route alerts to the right people
  • General guidelines: X team wants to by notified by Y

The "Alerts review" tab of this Google Sheet has some examples of what we are trying to do.

Change 974652 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Send recovery emails to data-engineering-alerts

https://gerrit.wikimedia.org/r/974652

Change 974652 merged by Btullis:

[operations/puppet@production] Send recovery emails to data-engineering-alerts

https://gerrit.wikimedia.org/r/974652

Since we started receiving alerts for every puppet failure in #wikimedia-analytics, the channel is really starting to be almost bot/alerts-only, which is impacting the whole of Data Engineering.

Here's my knee-jerk proposal to remediate that in the short term, that also tries to feed two birds with one scone and address the #wikimedia-analytics / #wikimedia-search split brain situation.

  • we create the #wikimedia-data-platform-sre channel (alternative names: #wikimedia-dpe-sre, #wikimedia-data-platform. @Gehel please feel free to weigh in so we make a consistent choice), henceforth referred to as #<chosen-name>
  • we create the #<chosen-name>-alerts channel
  • we deprecate both #wikimedia-search and #wikimedia-analytics to the profit of #chosen-name, for human conversations (and possibly ticket updates?). We redirect the readers to the new channel in the header of both channels.
  • we move all the alerts (search & analytics related) to #<chosen-name>-alerts for alerts only
  • we create a mailing list <chosen-name>-alerts@lists.wikimedia.org, that will receive the mail versions of the alerts
  • we create a VictorOps target called <chosen-name> for the urgent pages sent to the team

WDYT?

Sounds good. Here are my own knee-jerk responses :-)
Personally, I would vote for #wikimedia-data-platform - it needn't be so focused on SREs that we have it in the name.
For the mailing list, I would go for: data-platform-alerts@lists.wikimedia.org

we deprecate both Wikimedia-Search and #wikimedia-analytics

I think that there will be some people who would still like to come to these channels to talk about these topics, but it will be much easier for them to do so if the channels are not overrun by alerts and ticket updates.

we move all the alerts (search & analytics related) to #<chosen-name>-alerts for alerts only

This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale, but being careful about which should be for the attention of everyone who is on the Data Engineering Ops Week rota.

This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale, but being careful about which should be for the attention of everyone who is on the Data Engineering Ops Week rota.

Wholeheartedly agreed.

I've created a subticket for the IRC suggestions. The VictorOps suggestion is mentioned in T344202 , but we also need to complete T342578 (adding contact groups) before we can do anything with email or paging. I'll get to work on those.

A recent change by SRE observability has made this even more pressing, because the data-engineering-alerts@lists.wikimedia.org list is being overloaded with messages that are increasingly irrelevant for the data engineers.

This change was merged on Feb 6th, which increases the priority of the SystemdUnitFailed prometheus check from warning to critical.

The alertmanager configuration for the data-engineering team can be seen here

It shows that:

  • warning notifications are currently routed to the #wikimedia-analytics IRC channel
  • critical notifications are currently routed to the #wikimedia-analytics IRC channel and the data-engineering-alerts@lists.wikimedia.org mailing list

Looking at the emails, we can see that the most recent SystemdUnitFailed alert contains 20 failed systemd alerts across the fleet of servers for whom the role contact is data-engineering.

In my opinion, none of these alerts should have gone to the data-engineering-alerts mailing list.

The only two systemd alerts that are relevant to and actionable by the data engineering team sent a separate email direct to the list because they are based on a failed pipeline, running from a systemd based timer.

I believe that we need to start making some changes to the Icinga and Alertmanager configuration right away, in order not to put the data pipelines at risk by overwhelming the Ops Week engineers with noise.

Change 999561 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change all role contacts for Data Engineering -> Data Platform

https://gerrit.wikimedia.org/r/999561

Change 999561 merged by Btullis:

[operations/puppet@production] Change all role contacts for Data Engineering -> Data Platform

https://gerrit.wikimedia.org/r/999561

Change #1030189 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/alerts@master] Move some of the data-engineering alerts to data-platform

https://gerrit.wikimedia.org/r/1030189

Change #1030189 merged by jenkins-bot:

[operations/alerts@master] Move some of the data-engineering alerts to data-platform

https://gerrit.wikimedia.org/r/1030189