Alerts Review: determine if we can use Prometheus to alert based on historical datasets
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Feb 14 2024, 2:47 PM

Description

Creating this ticket to determine whether we can use Prometheus to alert based on historical datasets.

This conversation started in the Alerts Review doc . I've edited the remarks a bit, feel free to add more context by editing the task if you feel it necessary.

@Ottomata :

There is a use case that Prometheus does not support that makes up a big portion of alerts that DPE has to respond to: Alerts on historical datasets.

The Refine alerts are the best example, as are SLA alerts about Airflow scheduled and generated datasets.

We want the ability to alert on the generation, availability, and freshness of datasets.

@gmodena :

Could a workflow like this work?

ETL job fails

An alert / notification is triggered (no details, just a warn that something broke)

A phab is created by airflow with details on the error and how to. recover (e.g. run this command on this machine).

@Ottomata:

Probably, but what happens when the next run of the ETL job succeeds for a different hour?

@gmodena:

If we treat each run as a standalone service / job, would not this avoid overwriting history?

Related Objects
Search...

Status	Assigned	Task
Duplicate	None	T345698 [Epic] define a strategy around alerting for Data Platform SRE and implement it
Open	None	T346438 [Epic] Review alerting strategy for Data Platform SRE
Resolved	bking	T357537 Alerts Review: determine if we can use Prometheus to alert based on historical datasets

Event Timeline

bking created this task.Feb 14 2024, 2:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 14 2024, 2:47 PM

Gehel triaged this task as High priority.Feb 15 2024, 3:26 PM

Gehel moved this task from Incoming to Observability on the Data-Platform-SRE board.

Gehel added a parent task: T346438: [Epic] Review alerting strategy for Data Platform SRE.

lbowmaker moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Feb 16 2024, 8:47 PM

bking added subscribers: mforns, dcausse, EBernhardson.Feb 21 2024, 4:59 PM

bking added a subscriber: Antoine_Quhen.Feb 21 2024, 6:05 PM

bking changed the task status from Open to In Progress.Feb 27 2024, 4:22 PM

bking claimed this task.

bking edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE.

bking edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 4 2024, 2:40 PM

Based on last week's discussion, we believe it is possible to alert on using the Prometheus metrics. Thus, I'm closing out this ticket. Work to create the alerts is tracked in T359056 .

bking closed this task as Resolved.Mar 4 2024, 2:51 PM

bking moved this task from Backlog to Done on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.

Oh cool! @bking I read the linked notes but I'm missing how its gonna work? How can you alert on dataset $X for partition $N is failing? Is there a way to make partition or hour or datetime or whatever a label?

Alerts Review: determine if we can use Prometheus to alert based on historical datasetsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Alerts Review: determine if we can use Prometheus to alert based on historical datasets
Closed, ResolvedPublic
Actions

Related Objects
Search...