Page MenuHomePhabricator

Ingest a test hive database into datahub
Closed, ResolvedPublic

Description

In T306896: Integrate Spark with DataHub with lineage (Data-Engineering), we are experimenting with the Spark Datahub integration to see if we can get automated lineage for Hive tables used by Spark jobs. Preliminary trials look good! But, we are testing by writing to test tables in @tchin's Hive database. Thomas' database is not ingested by our regular airflow scheduled datahub ingestion, so Datahub doesn't know anything about the columns in those tables.

We should configure a test Hive database to always be ingested. If we need a separate database for Iceberg, we should add that too.

Done is

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add sandbox databases to datahub ingestionrepos/data-engineering/airflow-dags!811tchinadd_hive_sandboxmain
Customize query in GitLab

Event Timeline

What should the databases be called and where should it live? Should we have seperate databases for hive and iceberg tables?
/wmf/data/wmf_test and /wmf/data/wmf_test_iceberg?

Should we have seperate databases for hive and iceberg tables?

I think likely yes?

What should the databases be called?

wmf_test and wmf_test_iceberg sound fine to me. Actually I don't know if we need 'wmf' in these names. What about test_sandbox and test_iceberg_sandbox ? Just an idea. Let's bikeshed with folks who worked on T337562: Decide how to split wmf database into functional areas and document here (and elsewhere?)

cc @xcollazo

and where should it live?

Let's avoid the use of external tables for this. So just create database and tables without specifying location, and they'll end up in e.g. /user/hive/warehouse/wmf_test.db.

Should we have seperate databases for hive and iceberg tables?

I think likely yes?

Presto doesn't like shared databases between Hive & Iceberg. So if you intend to test Presto as well then you should separate.

What should the databases be called?

wmf_test and wmf_test_iceberg sound fine to me. Actually I don't know if we need 'wmf' in these names. What about test_sandbox and test_iceberg_sandbox ? Just an idea. Let's bikeshed with folks who worked on T337562: Decide how to split wmf database into functional areas and document here (and elsewhere?)

cc @xcollazo

This is for testing purposes, yes? If so, please avoid using wmf as those are for 'official' data.

and where should it live?

Let's avoid the use of external tables for this. So just create database and tables without specifying location, and they'll end up in e.g. /user/hive/warehouse/wmf_test.db.

+1

What about test_sandbox and test_iceberg_sandbox ?

Maybe 'test' and 'sandbox' are redundant? How about just sandbox and sandbox_iceberg? I'm getting 'sandbox' from e.g. https://en.wikipedia.org/wiki/Wikipedia:Sandbox

'sandbox' might be nice, because it indicates that the data there is temporary and might be deleted at any time.

I created the database sandbox and sandbox_iceberg and also created an interlanguage_navigation table in sandbox:

tchin@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics spark3-sql

spark-sql (default)> create database sandbox;
Response code
Time taken: 0.158 seconds
spark-sql (default)> create database sandbox_iceberg;
Response code
Time taken: 0.067 seconds
spark-sql (default)> use sandbox;
Response code
Time taken: 0.047 seconds
spark-sql (default)> CREATE TABLE IF NOT EXISTS `interlanguage_navigation` (
                   >     `project_family`    string  COMMENT 'The project family to aggregate on',
                   >     `current_project`   string  COMMENT 'The project (language) of this group of requests',
                   >     `previous_project`  string  COMMENT 'The project (language) found in the referers of this group of requests',
                   >     `navigation_count`  bigint  COMMENT 'The number of times a user navigated from the previous to the current project'
                   > )
                   > PARTITIONED BY (
                   >     `date`              string  COMMENT 'Date in YYYY-MM-DD format'
                   > )
                   > STORED AS PARQUET;
Response code
Time taken: 0.322 seconds

I then ran a job to backfill a day from wmf.interlanguage_naviagation to sandbox.interlanguage_navigation

INSERT OVERWRITE TABLE sandbox.interlanguage_navigation 
    SELECT /*+ COALESCE(1) */  
        *
    FROM wmf.interlanguage_navigation
    WHERE date="2024-08-14"
;
sudo -u analytics-privatedata spark3-sql \
--master local[12] --driver-memory 8G \
--jars ./acryl-spark-lineage-0.2.16.jar \
--conf "spark.app.name=sandbox_lineage_test" \
--conf "spark.extraListeners=datahub.spark.DatahubSparkListener" \
--conf "spark.datahub.rest.server=https://datahub-gms.discovery.wmnet:30443" \
-f sandbox_lineage_test.hql

Which can now be seen in Datahub: https://datahub.wikimedia.org/tasks/urn:li:dataJob:(urn:li:dataFlow:(spark,sandbox_lineage_test,default),sandbox_lineage_test)/Lineage?end_time_millis&is_lineage_mode=true&separate_siblings=false&show_columns=true&start_time_millis

Once the airflow job to ingest the sandbox databases are in place, it should automatically show the columns for the sandbox.interlanguage_navigation table

Just passing by to say this is looking *really* cool.