Page MenuHomePhabricator

Datahub - ingest Hive discovery database
Closed, ResolvedPublic

Description

The Hive discovery database contains a lot of interesting datasets, which look to be well organized. We should include these in the list of databases we ingest into Datahub.

This should just be adding a hive_discovery.yaml config file here

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Prefix discovery tables for Datahub ingestionrepos/data-engineering/airflow-dags!852tchindatahub-discovery-prefixmain
analytics: Correct discovery datahub ingestion schemarepos/data-engineering/airflow-dags!845ebernhardsonwork/ebernhardson/discovery-datahubmain
Add discovery database to datahubrepos/data-engineering/airflow-dags!833ebernhardsonwork/ebernhardson/discovery-datahubmain
Customize query in GitLab

Event Timeline

Gehel triaged this task as Medium priority.Sep 9 2024, 3:31 PM
Gehel subscribed.

We'll need to decide what is relevant to expose. And check the permissions / access.

Once todays daily run fires (approx Sept 24 00:00 UTC) can verify everything shows up as expected and complete this task.

Looking for tables that contain the column "source_text"only finds the table for the update pipeline event stream, but not the cirrus index dumps. Will need to look closer to see what might have went wrong.

While the ingestion now claims to work succesfully, I'm unable to find our tables in the datahub UI. It's possible I simply don't know how to use the thing, but searching for a column named source_text which should turn up our dump still only turns up the event streams.

The one thing we are doing differently from other ingest configs is providing a regex patterns that the tables must match. We might need to simply ingest all the tables. It's not the end of the world, but there are some messy bits in there that I thought it might be better to not expose.

We might need to simply ingest all the tables

I can probably take a look at why the table match isn’t working, next thing we could try is providing a custom transform function as a filter

I did a manual ingestion and was able to see the tables on datahub if I access it directly through a url
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.cirrus_index,PROD)/Schema?is_lineage_mode=false&schemaFilter=

However I ironically can't seem to be able to discover it through the normal search function and the discovery database container is empty (although maybe I just need to wait a little bit for it to populate)
https://datahub.wikimedia.org/container/urn:li:container:c8a7057ba838b9eb73d969c9b36acd2b/Entities?is_lineage_mode=false

Unfortunately still not finding the tables / columns via search in datahub. If it is a caching issue it's lasting longer than I would have expected.

This sounds like a DataHub problem to me. Maybe there is an issue with the indexing. Shall we make a ticket to investigate that? If you tag it with Data-Platform-SRE we will investigate why these entities aren't appearing in the search results.