[Epic] Review alerting strategy for Data Platform SRE
Open, HighPublic
Actions

Assigned To

None

Authored By

	Gehel
	Sep 15 2023, 12:53 PM

Description

The newly formed Data Platform SRE team has alerts in spread out in numerous systems, without clear strategy of how to move forward. In particular, alerts are sent via different channels (email, dashboards, IRC, etc...) and are routed without a clear intent (sometimes to DPE SRE, sometimes to the teams we support).

AC:

list the various types of alerts that we currently have
document our intentions around alerting see the Alerts Review Google doc
identify which alerts needs to be routed where
ensure that all DPE SREs are subscribed to the appropriate channels
align our reality with whatever we decided

Details

Subject	Repo	Branch	Lines +/-
Change all role contacts for Data Engineering -> Data Platform	operations/puppet	production	+72 -71
Move some of the data-engineering alerts to data-platform	operations/alerts	master	+68 -68
Send recovery emails to data-engineering-alerts	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T345698 [Epic] define a strategy around alerting for Data Platform SRE and implement it
Open	None	T346438 [Epic] Review alerting strategy for Data Platform SRE
Open	None	T309012 Migrate zookeeper prometheus checks from Icinga to Alertmanager
Open	None	T343761 Confirm TLS certificate monitoring is in place for Search Platform-owned domains
Resolved	BTullis	T342578 Ensure Data Platform SREs have a contact group in puppet/alerting
Resolved	BTullis	T344202 Create VictorOps config for new Data Platform SRE team
Open	None	T336574 Review alerting around Wikidata Query Service update pipeline
Declined	None	T346807 Review alerting around Search update pipeline
Declined	None	T337055 Send a critical alert to data-engineering if produce_canary_events isn't running correctly
Open	None	T337052 Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager
Open	None	T352783 Change data platform-related IRC channels to improve communication
Resolved	bking	T357537 Alerts Review: determine if we can use Prometheus to alert based on historical datasets
Resolved	BTullis	T358205 Investigate late/delayed Airflow task failure notifications
Resolved	bking	T349772 Create dashboards/alerts for new Cirrus Streaming Updater
Resolved	bking	T359213 Adapt Flink-related rdf-streaming-updater alerts for Cirrus Streaming Updater
Resolved	bking	T361862 Determine cause of unexpectedly high blackbox poller entries in wdqs nginx access logs

Event Timeline

Gehel created this task.Sep 15 2023, 12:53 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2023, 12:53 PM

Gehel added a subtask: T309012: Migrate zookeeper prometheus checks from Icinga to Alertmanager.Sep 15 2023, 12:53 PM

Gehel added a subtask: T343761: Confirm TLS certificate monitoring is in place for Search Platform-owned domains.

Gehel added a subtask: T342578: Ensure Data Platform SREs have a contact group in puppet/alerting.

Gehel added a subtask: T336574: Review alerting around Wikidata Query Service update pipeline.

Per today's Data Platform SRE meeting, I've committed to lead this effort. To that end, I've added a sheet to the Data Platform SRE Google Sheet (sorry if this is the wrong place, we can always move it later though).

I'll solicit some further feedback, create some subtasks and update.

andrea.denisse subscribed.Sep 19 2023, 9:12 PM

Should this be a child of T345698: [Epic] define a strategy around alerting for Data Platform SRE and implement it or should they be merged, I wonder?

Moving to radar to keep an eye out in case you need our help. Thanks!

bking mentioned this in T346807: Review alerting around Search update pipeline.Sep 21 2023, 5:12 PM

bking closed subtask T346807: Review alerting around Search update pipeline as Declined.

The essay "My Philosophy on Alerting" was linked in the Linuxfoundation training @RKemper and I are taking. I haven't read it yet, but it's probably worth reading if one of the creators of Prometheus thinks it's important.

Gehel triaged this task as High priority.Oct 11 2023, 8:47 AM

Gehel moved this task from Incoming to Epics on the Data-Platform-SRE board.

Gehel added a subtask: T337055: Send a critical alert to data-engineering if produce_canary_events isn't running correctly.Oct 18 2023, 8:55 AM

Gehel added a subtask: T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager.

This approach discussed with @Gehel at our 1x1 meeting:

Start with an audit of current alerts - specifics not required, just types of alerts
Tie those to SLO / SLI, do we want SLO / SLI?
How to route alerts to the right people
General guidelines: X team wants to by notified by Y

The "Alerts review" tab of this Google Sheet has some examples of what we are trying to do.

bking added a parent task: T345698: [Epic] define a strategy around alerting for Data Platform SRE and implement it.Nov 3 2023, 3:20 PM

TBurmeister subscribed.Nov 6 2023, 5:48 PM

BTullis mentioned this in T345698: [Epic] define a strategy around alerting for Data Platform SRE and implement it.Nov 7 2023, 4:09 PM

bking updated the task description. (Show Details)Nov 8 2023, 6:25 PM

Change 974652 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Send recovery emails to data-engineering-alerts

https://gerrit.wikimedia.org/r/974652

gerritbot added a project: Patch-For-Review.Nov 15 2023, 5:37 PM

Change 974652 merged by Btullis:

[operations/puppet@production] Send recovery emails to data-engineering-alerts

https://gerrit.wikimedia.org/r/974652

Maintenance_bot removed a project: Patch-For-Review.Nov 16 2023, 5:10 PM

Since we started receiving alerts for every puppet failure in #wikimedia-analytics, the channel is really starting to be almost bot/alerts-only, which is impacting the whole of Data Engineering.

Here's my knee-jerk proposal to remediate that in the short term, that also tries to feed two birds with one scone and address the #wikimedia-analytics / #wikimedia-search split brain situation.

we create the #wikimedia-data-platform-sre channel (alternative names: #wikimedia-dpe-sre, #wikimedia-data-platform. @Gehel please feel free to weigh in so we make a consistent choice), henceforth referred to as #<chosen-name>
we create the #<chosen-name>-alerts channel
we deprecate both #wikimedia-search and #wikimedia-analytics to the profit of #chosen-name, for human conversations (and possibly ticket updates?). We redirect the readers to the new channel in the header of both channels.
we move all the alerts (search & analytics related) to #<chosen-name>-alerts for alerts only
we create a mailing list <chosen-name>-alerts@lists.wikimedia.org, that will receive the mail versions of the alerts
we create a VictorOps target called <chosen-name> for the urgent pages sent to the team

WDYT?

Sounds good. Here are my own knee-jerk responses :-)
Personally, I would vote for #wikimedia-data-platform - it needn't be so focused on SREs that we have it in the name.
For the mailing list, I would go for: data-platform-alerts@lists.wikimedia.org

we deprecate both Wikimedia-Search and #wikimedia-analytics

I think that there will be some people who would still like to come to these channels to talk about these topics, but it will be much easier for them to do so if the channels are not overrun by alerts and ticket updates.

we move all the alerts (search & analytics related) to #<chosen-name>-alerts for alerts only

This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale, but being careful about which should be for the attention of everyone who is on the Data Engineering Ops Week rota.

This is where I think we need to be careful to split up the data-engineering-alerts and data-platform-alerts - so not migrating everything wholesale, but being careful about which should be for the attention of everyone who is on the Data Engineering Ops Week rota.

Wholeheartedly agreed.

I've created a subticket for the IRC suggestions. The VictorOps suggestion is mentioned in T344202 , but we also need to complete T342578 (adding contact groups) before we can do anything with email or paging. I'll get to work on those.

BTullis merged a task: T345698: [Epic] define a strategy around alerting for Data Platform SRE and implement it.Feb 8 2024, 10:25 AM

A recent change by SRE observability has made this even more pressing, because the data-engineering-alerts@lists.wikimedia.org list is being overloaded with messages that are increasingly irrelevant for the data engineers.

This change was merged on Feb 6th, which increases the priority of the SystemdUnitFailed prometheus check from warning to critical.

The alertmanager configuration for the data-engineering team can be seen here

It shows that:

warning notifications are currently routed to the #wikimedia-analytics IRC channel
critical notifications are currently routed to the #wikimedia-analytics IRC channel and the data-engineering-alerts@lists.wikimedia.org mailing list

Looking at the emails, we can see that the most recent SystemdUnitFailed alert contains 20 failed systemd alerts across the fleet of servers for whom the role contact is data-engineering.

In my opinion, none of these alerts should have gone to the data-engineering-alerts mailing list.

The only two systemd alerts that are relevant to and actionable by the data engineering team sent a separate email direct to the list because they are based on a failed pipeline, running from a systemd based timer.

I believe that we need to start making some changes to the Icinga and Alertmanager configuration right away, in order not to put the data pipelines at risk by overwhelming the Ops Week engineers with noise.

Change 999561 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Change all role contacts for Data Engineering -> Data Platform

https://gerrit.wikimedia.org/r/999561

gerritbot added a project: Patch-For-Review.Feb 9 2024, 10:13 AM

Change 999561 merged by Btullis:

[operations/puppet@production] Change all role contacts for Data Engineering -> Data Platform

https://gerrit.wikimedia.org/r/999561

Maintenance_bot removed a project: Patch-For-Review.Feb 9 2024, 12:31 PM

bking updated the task description. (Show Details)Feb 14 2024, 2:23 PM

Gehel added a subtask: T357537: Alerts Review: determine if we can use Prometheus to alert based on historical datasets.Feb 15 2024, 3:26 PM

BTullis closed subtask T342578: Ensure Data Platform SREs have a contact group in puppet/alerting as Resolved.Feb 19 2024, 9:40 AM

bking mentioned this in T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team.Feb 20 2024, 6:11 PM