We already get alerted if the DAG fails (or if a single task fails and reaches its max retry setting)
This is the standard Airflow monitoring framework, which is already in place.
We have several (potentially complementary) options in terms of more sophisticated monitoring:
- we could determine an SLA for each global dump DAG (probably differentiating full and partial dumps)
- we could also define a DAG that runs a sensor for each wiki, making sure that various dump files made it to the clouddumps servers and are publicly visible on the internet
- we could set up a DAG that performs an end-to-end dump of a wiki and checks daily whether the whole pipeline functions correctly.
There is also considerable cross-over with T343234: Explore the use of Airflow notifiers for more flexible DAG failure handling and T356416: [M] Consider improving e-mail alerts sent by Airflow DAGs