Spike: Investigate lineage from Airflow and Spark
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Milimetric
	Jul 1 2022, 6:03 PM

Related Objects

Mentioned In: T312566: Emit lineage information about Airflow jobs to DataHub

Event Timeline

Milimetric created this task.Jul 1 2022, 6:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 1 2022, 6:03 PM

This was just a quick research task. My conclusion is that where we track lineage is partly style but mostly strategy.

In Airflow, we would emit lineage information kind of like this. In some cases, it's possible that the jobs we launch from Airflow are complex and create intermediate datasets. We can also emit lineage information from Spark, or a combination of Spark and Airflow. So the question is basically:

Do we emit from only Airflow, only Spark, or a combination? If only Airflow, it's easier to make this a shared best practice across WMF. If only Spark, we need to change how we wrap jobs and standardize completely on Spark. If a combination, we could emit everything obvious from Airflow (the information that the scheduler has to know about inputs and outputs informs the sensors and operators that we create, so we'd just have to rework those a bit).

Milimetric mentioned this in T312566: Emit lineage information about Airflow jobs to DataHub.Jul 7 2022, 3:57 PM

Milimetric closed this task as Resolved.Aug 23 2022, 5:12 PM

Spike: Investigate lineage from Airflow and SparkClosed, ResolvedPublicActions

Related Objects

Event Timeline

Spike: Investigate lineage from Airflow and Spark
Closed, ResolvedPublic
Actions