Spike: Integrate Spark with DataHub
Spike Goal
Determine what the user experience is when integrating DataHub with Spark

Key Questions:

  • What do we get when we integrate. spark in such a way that Spark? Is this something we want to support?
  • Evaluate the creation of
    • Pipelines
    • Tasks
    • Lineage between source and destination datasets
  • Can this play a part in the broader Data-Platform strategy.

Interesting, how would this contrast with lineage from Airflow that schedules Spark jobs?

Good question! I honestly don't know at the moment. They might be able to complement each other, or it might be confusing to use them both.
I think a bit of experimentation is in order.

Spark commands supported​

Below is a list of Spark commands that are parsed currently:

  • InsertIntoHadoopFsRelationCommand
  • SaveIntoDataSourceCommand (jdbc)
  • CreateHiveTableAsSelectCommand
  • InsertIntoHiveTable

Effectively, these support data sources/sinks corresponding to Hive, HDFS and JDBC.

Configuring Spark/DataHub integration in Notebooks

I executed the following code in a notebook in order to test the Spark/DataHub integration.

import wmfdata
spark = wmfdata.spark.get_custom_session(
        'spark.executor.memory': '4g',
        'spark.jars.packages': '',
        'spark.extraListeners': 'datahub.spark.DatahubSparkListener',
        '': 'https://datahub-gms.discovery.wmnet:30443'
    SELECT meta.domain, count(*)
    FROM event.mediawiki_page_create
    WHERE year=2022 AND month=4 AND day=26 and hour=0
    GROUP BY meta.domain
    ORDER BY count(*) DESC
    LIMIT 10

This resulted in the creation of a pipeline and the addition of the Spark platform. The pipeline was not complete, because I didn't include one of the supported Spark commands, but it shows that at least this element is working.

image.png (821×770 px, 77 KB)

Making this into a spike for upcoming work.

I think that it might be valuable to spend a bit more time experimenting with this approach, now that we have:

  • Upgraded DataHub to version 0.10.4
  • Finished the spark2 -> spark3 migration

