Spike: Integrate Spark with DataHub
Open, Needs TriagePublic8 Estimated Story Points
Actions

Assigned To

None

Authored By

	BTullis
	Apr 26 2022, 12:59 PM

Description

Spike Goal

Determine what the user experience is when integrating DataHub with Spark

Key Questions:

What do we get when we integrate. spark in such a way that Spark? Is this something we want to support?
Evaluate the creation of
- Pipelines
- Tasks
- Lineage between source and destination datasets
Can this play a part in the broader Data-Platform strategy.

https://datahubproject.io/docs/metadata-integration/java/spark-lineage/

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		BTullis	T299910 Data Catalog MVP
		Open		None	T306896 Spike: Integrate Spark with DataHub

Event Timeline

BTullis created this task.Apr 26 2022, 12:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 26 2022, 12:59 PM

BTullis added a parent task: T299910: Data Catalog MVP.Apr 26 2022, 1:57 PM

Interesting, how would this contrast with lineage from Airflow that schedules Spark jobs?

In T306896#7881056, @Ottomata wrote:

Interesting, how would this contrast with lineage from Airflow that schedules Spark jobs?

Good question! I honestly don't know at the moment. They might be able to complement each other, or it might be confusing to use them both.
I think a bit of experimentation is in order.

Spark commands supported

Below is a list of Spark commands that are parsed currently:

InsertIntoHadoopFsRelationCommand
SaveIntoDataSourceCommand (jdbc)
CreateHiveTableAsSelectCommand
InsertIntoHiveTable

Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.

Configuring Spark/DataHub integration in Notebooks

I executed the following code in a notebook in order to test the Spark/DataHub integration.

import wmfdata
spark = wmfdata.spark.get_custom_session(
    spark_config={
        'spark.executor.memory': '4g',
        'spark.jars.packages': 'http://archiva.wikimedia.org/repository/mirror-maven-central/io.acryl:datahub-spark-lineage:0.8.32',
        'spark.extraListeners': 'datahub.spark.DatahubSparkListener',
        'spark.datahub.rest.server': 'https://datahub-gms.discovery.wmnet:30443'
    }
)
spark.sql(
    """
    SELECT meta.domain, count(*)
    FROM event.mediawiki_page_create
    WHERE year=2022 AND month=4 AND day=26 and hour=0
    GROUP BY meta.domain
    ORDER BY count(*) DESC
    LIMIT 10
    """
).show()

This resulted in the creation of a pipeline and the addition of the Spark platform. The pipeline was not complete, because I didn't include one of the supported Spark commands, but it shows that at least this element is working.

BTullis mentioned this in T306977: Integrate Airflow with DataHub.Apr 27 2022, 11:47 AM

The configuration for spark-submit jobs should be very similar.

https://datahubproject.io/docs/metadata-integration/java/spark-lineage#configuration-instructions-spark-submit

BTullis updated the task description. (Show Details)Apr 27 2022, 11:59 AM

Making this into a spike for upcoming work.

• EChetty set the point value for this task to 2.May 5 2022, 4:22 PM

• EChetty moved this task from Backlog to Next Up on the Data-Catalog board.

• EChetty moved this task from Next Up to Backlog on the Data-Catalog board.May 10 2022, 6:02 PM

JArguello-WMF moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.May 11 2022, 6:40 PM

BTullis edited projects, added Data-Engineering-Planning, Data Pipelines; removed Data-Engineering.Dec 1 2022, 12:22 PM

BTullis moved this task from Backlog to Data Catalog on the Data-Engineering-Planning board.

Mayakp.wiki subscribed.Mar 8 2023, 10:35 PM

JArguello-WMF removed a project: Data-Catalog.Jun 29 2023, 3:42 PM

JArguello-WMF added a project: Data-Catalog.Jun 29 2023, 9:24 PM

JArguello-WMF moved this task from Data Catalog to To be Discussed on the Data-Engineering-Planning board.Jun 29 2023, 9:27 PM

JArguello-WMF moved this task from To be Discussed to Data Catalog on the Data-Engineering-Planning board.Jun 29 2023, 9:30 PM

JArguello-WMF moved this task from Data Catalog to Pipelines on the Data-Engineering-Planning board.Jun 29 2023, 9:34 PM

JArguello-WMF removed a project: Data-Engineering-Planning.Jun 29 2023, 10:00 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:53 PM

I think that it might be valuable to spend a bit more time experimenting with this approach, now that we have:

Upgraded DataHub to version 0.10.4
Finished the spark2 -> spark3 migration

@lbowmaker - Tagging you for visibility and so that you can consider what to do with this ticket about getting spark to self-document itself in DataHub.

This is a task that I originally created as a suggestion during the MVP stage of DataHub, but it never really got evaluated as to whether or not this approach offers potential value.
At present it's the only remaining ticket that is open under the epic of T299910: Data Catalog MVP, but I'm minded to resolve that that anyway and I'm keen that we don't just lose this ticket in doing so.

lbowmaker edited projects, added Data-Engineering; removed Data Engineering and Event Platform Team.Nov 10 2023, 2:50 PM

lbowmaker moved this task from Security & Governance to Incoming (new tickets) on the Data-Engineering board.

lbowmaker moved this task from Incoming (new tickets) to DataHub on the Data-Engineering board.Feb 9 2024, 3:09 PM

lbowmaker moved this task from DataHub to To be estimated/discussed on the Data-Engineering board.Mar 25 2024, 2:35 PM

lbowmaker removed the point value for this task.Mar 26 2024, 12:15 AM

lbowmaker edited projects, added Data-Engineering (Q4 2024 April 1st - June 30th); removed Data-Engineering.Apr 1 2024, 12:21 PM

lbowmaker set the point value for this task to 8.