Spike Goal
Determine what the user experience is when integrating DataHub with Spark
Key Questions:
- What do we get when we integrate Spark with DataHub? Is this something we want to support?
- Evaluate the creation of
- Pipelines
- Tasks
- Lineage between source and destination datasets
- Can this play a part in the broader Data-Platform strategy.
- Can we just do this for one Airflow/Spark job and see what we can visualize in DataHub
Deprecated: https://datahubproject.io/docs/metadata-integration/java/spark-lineage/
New version: https://datahubproject.io/docs/metadata-integration/java/spark-lineage-beta
New-new version: https://datahubproject.io/docs/metadata-integration/java/acryl-spark-lineage
Spike results:
The Spike has been successfully completed in Q1, demonstrating the usage of the DatahubSparkListener in practise, surfacing column level lineage information in DataHub.
Completion Requirements
- Get Spark Lineage working in DataHub
- Make Spark based lineage configurable
- Enable Spark based lineage for a suitable test Spark job (hive)
- Enable Spark based lineage for all suitable Spark jobs in the analytics airflow instance(hive)
Follow ups
- Version upgrade Spark and Iceberg and the connector to support Iceberg
- Enable Spark based lineage for all remaining Spark jobs using Iceberg tables
- Enable for other airflow instances.



