Page MenuHomePhabricator

Spike: Investigate lineage from Airflow and Spark
Closed, ResolvedPublic

Event Timeline

This was just a quick research task. My conclusion is that where we track lineage is partly style but mostly strategy.

In Airflow, we would emit lineage information kind of like this. In some cases, it's possible that the jobs we launch from Airflow are complex and create intermediate datasets. We can also emit lineage information from Spark, or a combination of Spark and Airflow. So the question is basically:

Do we emit from only Airflow, only Spark, or a combination? If only Airflow, it's easier to make this a shared best practice across WMF. If only Spark, we need to change how we wrap jobs and standardize completely on Spark. If a combination, we could emit everything obvious from Airflow (the information that the scheduler has to know about inputs and outputs informs the sensors and operators that we create, so we'd just have to rework those a bit).