Page MenuHomePhabricator

Alerts Review: determine if we can use Prometheus to alert based on historical datasets
Closed, ResolvedPublic

Description

Creating this ticket to determine whether we can use Prometheus to alert based on historical datasets.

This conversation started in the Alerts Review doc . I've edited the remarks a bit, feel free to add more context by editing the task if you feel it necessary.

@Ottomata :

There is a use case that Prometheus does not support that makes up a big portion of alerts that DPE has to respond to: Alerts on historical datasets.

The Refine alerts are the best example, as are SLA alerts about Airflow scheduled and generated datasets.

We want the ability to alert on the generation, availability, and freshness of datasets.

@gmodena :

Could a workflow like this work?

  1. ETL job fails
  2. An alert / notification is triggered (no details, just a warn that something broke)
  3. A phab is created by airflow with details on the error and how to. recover (e.g. run this command on this machine).

@Ottomata:

Probably, but what happens when the next run of the ETL job succeeds for a different hour?

@gmodena:

If we treat each run as a standalone service / job, would not this avoid overwriting history?

Event Timeline

bking changed the task status from Open to In Progress.Feb 27 2024, 4:22 PM
bking claimed this task.

Based on last week's discussion, we believe it is possible to alert on using the Prometheus metrics. Thus, I'm closing out this ticket. Work to create the alerts is tracked in T359056 .

Oh cool! @bking I read the linked notes but I'm missing how its gonna work? How can you alert on dataset $X for partition $N is failing? Is there a way to make partition or hour or datetime or whatever a label?