Page MenuHomePhabricator

Alert when DAGs are inactive or the airflow scheduler is down for too long
Open, HighPublic

Description

As a search user, I want to search up to date documents. In case of stalled indices, I want those to be fixed promptly.

As a maintainer of the search data pipeline I want to be alerted if some DAGs are not being scheduled so that I can re-enable them.

When doing hadoop maintenance it might happen that the DAGs are disabled (and/or the airflow scheduler is stopped) to help drain the YARN cluster. If we forget to re-enable those after the maintenance is over we get no alerts, airflow SLAs are being checked since they depend on the fact that the DAG can be executed.
There should be external monitoring making sure that airflow is not in this "maintenance mode" for too long (48h?).

AC:

  • alert when active DAGs are off for too long
  • alert when the airflow scheduler is off for too long.

Event Timeline

MPhamWMF moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.
Gehel raised the priority of this task from High to Needs Triage.Jul 19 2021, 3:20 PM
Gehel updated the task description. (Show Details)