As a search user, I want to search up to date documents. In case of stalled indices, I want those to be fixed promptly.
As a maintainer of the search data pipeline I want to be alerted if some DAGs are not being scheduled so that I can re-enable them.
When doing hadoop maintenance it might happen that the DAGs are disabled (and/or the airflow scheduler is stopped) to help drain the YARN cluster. If we forget to re-enable those after the maintenance is over we get no alerts, airflow SLAs are being checked since they depend on the fact that the DAG can be executed.
There should be external monitoring making sure that airflow is not in this "maintenance mode" for too long (48h?).
- alert when active DAGs are off for too long
- alert when the airflow scheduler is off for too long.