In T306896: Integrate Spark with DataHub with lineage (Data-Engineering), we are experimenting with the Spark Datahub integration to see if we can get automated lineage for Hive tables used by Spark jobs. Preliminary trials look good! But, we are testing by writing to test tables in @tchin's Hive database. Thomas' database is not ingested by our regular airflow scheduled datahub ingestion, so Datahub doesn't know anything about the columns in those tables.
We should configure a test Hive database to always be ingested. If we need a separate database for Iceberg, we should add that too.
Done is
- datahub/ingestion dag configured to ingest test databases