The Hive discovery database contains a lot of interesting datasets, which look to be well organized. We should include these in the list of databases we ingest into Datahub.
This should just be adding a hive_discovery.yaml config file here
The Hive discovery database contains a lot of interesting datasets, which look to be well organized. We should include these in the list of databases we ingest into Datahub.
This should just be adding a hive_discovery.yaml config file here
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Prefix discovery tables for Datahub ingestion | repos/data-engineering/airflow-dags!852 | tchin | datahub-discovery-prefix | main | |
| analytics: Correct discovery datahub ingestion schema | repos/data-engineering/airflow-dags!845 | ebernhardson | work/ebernhardson/discovery-datahub | main | |
| Add discovery database to datahub | repos/data-engineering/airflow-dags!833 | ebernhardson | work/ebernhardson/discovery-datahub | main |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T369756 [EPIC] Datahub Improvements | |||
| Resolved | EBernhardson | T374118 Datahub - ingest Hive discovery database | |||
| Resolved | BTullis | T376657 Unable to find ingested tables in datahub |
Kindly please prioritize this. Context: https://wikimedia.slack.com/archives/CSV483812/p1725654468976119?thread_ts=1725653350.510639&cid=CSV483812
ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/833
Add discovery database to datahub
milimetric merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/833
Add discovery database to datahub
Once todays daily run fires (approx Sept 24 00:00 UTC) can verify everything shows up as expected and complete this task.
Looking for tables that contain the column "source_text"only finds the table for the update pipeline event stream, but not the cirrus index dumps. Will need to look closer to see what might have went wrong.
ebernhardson opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/845
Add discovery database to datahub
otto merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/845
analytics: Correct discovery datahub ingestion schema
While the ingestion now claims to work succesfully, I'm unable to find our tables in the datahub UI. It's possible I simply don't know how to use the thing, but searching for a column named source_text which should turn up our dump still only turns up the event streams.
The one thing we are doing differently from other ingest configs is providing a regex patterns that the tables must match. We might need to simply ingest all the tables. It's not the end of the world, but there are some messy bits in there that I thought it might be better to not expose.
We might need to simply ingest all the tables
I can probably take a look at why the table match isn’t working, next thing we could try is providing a custom transform function as a filter
tchin opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/852
Prefix discovery tables for Datahub ingestion
I did a manual ingestion and was able to see the tables on datahub if I access it directly through a url
https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.cirrus_index,PROD)/Schema?is_lineage_mode=false&schemaFilter=
However I ironically can't seem to be able to discover it through the normal search function and the discovery database container is empty (although maybe I just need to wait a little bit for it to populate)
https://datahub.wikimedia.org/container/urn:li:container:c8a7057ba838b9eb73d969c9b36acd2b/Entities?is_lineage_mode=false
tchin merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/852
Prefix discovery tables for Datahub ingestion
Unfortunately still not finding the tables / columns via search in datahub. If it is a caching issue it's lasting longer than I would have expected.
This sounds like a DataHub problem to me. Maybe there is an issue with the indexing. Shall we make a ticket to investigate that? If you tag it with Data-Platform-SRE we will investigate why these entities aren't appearing in the search results.
@BTullis @EBernhardson can this be resolved now that T376657: Unable to find ingested tables in datahub is fixed?