Page MenuHomePhabricator

Investigate why airflow sensor tasks fail without sending errors
Closed, ResolvedPublic5 Estimated Story Points

Description

Issue Description:

A number of airflow had failed tasks, but we had received no error email. It seems that all the failed tasks were all sensors.

Impact:

This could lead to a critical failure going unnoticed

Tasks:
  • Investigate - why are these failures not triggering a notification
  • Develop a fix

Details

Other Assignee
JAllemandou

Event Timeline

EChetty triaged this task as High priority.
EChetty updated the task description. (Show Details)
EChetty added a project: Data Pipelines.
EChetty updated Other Assignee, added: JAllemandou.
EChetty set the point value for this task to 5.Jul 7 2022, 8:53 AM

Possible solutions:

  • Upgrade Airflow
  • Not enough disk space
    • Clear some disk space on the machines? Needs to be investigated
  • Move Airflow to a VM and test it in an isolated environment

Since we deleted some airflow logs under an-launcher1002:/srv/analytics-airflow/logs this issue has not happened.
Also, every time there was a sensor silent failure, at least one of the sensor's log files was missing (Airflow couldn't find it).
From this, and some team conversations, we suspect that the silent failures could be caused by the logs filling up the /srv disk in an-launcher1002.
So, I was looking at the logs and found out that an Airflow bug is polluting the logs (and breaking SLAs), see: https://phabricator.wikimedia.org/T314181#8116392

Things we can do:

  • It seems we urgently need an Airflow upgrade.
  • We can also set up some Airflow scheduler log rotation or deletion.

The problem has happened again today.
I'm in favor of creating a dedicated alarm for airflow failed tasks: if any dagRun of an instance has a failed task, the alarm fires.
This would be feasible using the airflow rest API:
curl http://an-launcher1002.eqiad.wmnet:8600/api/v1/dags/~/dagRuns?state=failed
But, the state filter has been added in version 2.3.0 of airflow.
We therefore need to upgrade before having this - ping T315580 !

lbowmaker subscribed.

Resolving ticket as we have moved on with Airflow versions.