Creating this ticket to determine whether we can use Prometheus to alert based on historical datasets.
This conversation started in the Alerts Review doc . I've edited the remarks a bit, feel free to add more context by editing the task if you feel it necessary.
There is a use case that Prometheus does not support that makes up a big portion of alerts that DPE has to respond to: Alerts on historical datasets.
The Refine alerts are the best example, as are SLA alerts about Airflow scheduled and generated datasets.
We want the ability to alert on the generation, availability, and freshness of datasets.
@gmodena :
Could a workflow like this work?
- ETL job fails
- An alert / notification is triggered (no details, just a warn that something broke)
- A phab is created by airflow with details on the error and how to. recover (e.g. run this command on this machine).
Probably, but what happens when the next run of the ETL job succeeds for a different hour?
If we treat each run as a standalone service / job, would not this avoid overwriting history?