Page MenuHomePhabricator

Spike: Integrate Spark with DataHub
Open, Needs TriagePublic8 Estimated Story Points

Description

Spike Goal
Determine what the user experience is when integrating DataHub with Spark

Key Questions:

  • What do we get when we integrate. spark in such a way that Spark? Is this something we want to support?
  • Evaluate the creation of
    • Pipelines
    • Tasks
    • Lineage between source and destination datasets
  • Can this play a part in the broader Data-Platform strategy.

https://datahubproject.io/docs/metadata-integration/java/spark-lineage/

Related Objects

Event Timeline

Interesting, how would this contrast with lineage from Airflow that schedules Spark jobs?

Interesting, how would this contrast with lineage from Airflow that schedules Spark jobs?

Good question! I honestly don't know at the moment. They might be able to complement each other, or it might be confusing to use them both.
I think a bit of experimentation is in order.

Spark commands supported​

Below is a list of Spark commands that are parsed currently:

  • InsertIntoHadoopFsRelationCommand
  • SaveIntoDataSourceCommand (jdbc)
  • CreateHiveTableAsSelectCommand
  • InsertIntoHiveTable

Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.

Configuring Spark/DataHub integration in Notebooks

I executed the following code in a notebook in order to test the Spark/DataHub integration.

import wmfdata
spark = wmfdata.spark.get_custom_session(
    spark_config={
        'spark.executor.memory': '4g',
        'spark.jars.packages': 'http://archiva.wikimedia.org/repository/mirror-maven-central/io.acryl:datahub-spark-lineage:0.8.32',
        'spark.extraListeners': 'datahub.spark.DatahubSparkListener',
        'spark.datahub.rest.server': 'https://datahub-gms.discovery.wmnet:30443'
    }
)
spark.sql(
    """
    SELECT meta.domain, count(*)
    FROM event.mediawiki_page_create
    WHERE year=2022 AND month=4 AND day=26 and hour=0
    GROUP BY meta.domain
    ORDER BY count(*) DESC
    LIMIT 10
    """
).show()

This resulted in the creation of a pipeline and the addition of the Spark platform. The pipeline was not complete, because I didn't include one of the supported Spark commands, but it shows that at least this element is working.

image.png (821×770 px, 77 KB)

EChetty renamed this task from Integrate Spark with DataHub to Spike: Integrate Spark with DataHub.May 5 2022, 4:21 PM
EChetty updated the task description. (Show Details)
EChetty subscribed.

Making this into a spike for upcoming work.

EChetty set the point value for this task to 2.May 5 2022, 4:22 PM
EChetty moved this task from Backlog to Next Up on the Data-Catalog board.

I think that it might be valuable to spend a bit more time experimenting with this approach, now that we have:

  • Upgraded DataHub to version 0.10.4
  • Finished the spark2 -> spark3 migration

@lbowmaker - Tagging you for visibility and so that you can consider what to do with this ticket about getting spark to self-document itself in DataHub.

This is a task that I originally created as a suggestion during the MVP stage of DataHub, but it never really got evaluated as to whether or not this approach offers potential value.
At present it's the only remaining ticket that is open under the epic of T299910: Data Catalog MVP, but I'm minded to resolve that that anyway and I'm keen that we don't just lose this ticket in doing so.

lbowmaker removed the point value for this task.Mar 26 2024, 12:15 AM