In T374118 portions of the discovery hive database were ingested to datahub. The tables appear to be ingested, for example the discovery.cirrus_index table is fetchable from the UI. But none of the newly ingested tables are findable with the search functionality.
Description
Details
- Other Assignee
- EBernhardson
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T369756 [EPIC] Datahub Improvements | |||
| Resolved | EBernhardson | T374118 Datahub - ingest Hive discovery database | |||
| Resolved | BTullis | T376657 Unable to find ingested tables in datahub |
Event Timeline
I've also had the same experience with Airflow DAG/task objects.
- Some can be found just fine ([example](https://datahub.wikimedia.org/tasks/urn:li:dataJob:(urn:li:dataFlow:(airflow,test_s3_connection,prod),test_s3_connection)/Runs?is_lineage_mode=false))
- For some, we only find the DAG and no tasks)
Actually, while writing this message, I realized that the missing Airflow data was caused by a discrepancy between the Datahub version and the datahub client library version, fixed in https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/20
Tasks and DAGs seem to be searchable correctly now, FWIW.
When I do a search for discovery in DataHub I can see 17 hive tables returned.
{F57623689}
I exported the results as a CSV and the entity URLs are here:
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.webrequest_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.wikibase_rdf_subgraphs,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.cirrus_index,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.cirrus_index_without_content,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.search_satisfaction_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.wikibase_rdf,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.subgraph_pair_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.general_subgraph_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.popularity_score,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.query_clicks_daily,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.subgraph_pair_query_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.query_clicks_hourly,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.general_query_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.general_subgraph_query_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.per_subgraph_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.per_subgraph_query_metrics,PROD)
- https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,discovery.processed_external_sparql_query,PROD)
However, I see that the table pattern would allow for more than 17 possible matches.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/datahub/ingestion/configs/hive_discovery.yaml#L6-20
Before I go any further, does this search query match what you are seeing @EBernhardson?
If you don't see the same results, perhaps there is a problem with the permissins of the newly imported objects.
If you can see the same search results, are we able to identify any specific missing entites from the search results?
Curious, i can't say what changed but today I'm getting search results that I wasn't a week ago. Searching up a few tables by column names i'm familiar with is now finding them. In the general case this looks to be resolved.
There is still some odd behaviour though. If I search for cirrus_index the index comes up. If I filter the search for Datasets > Hive the index is no longer found. But the UI shows cirrus_index as Table | Hive | discovery . I didn't test all tables, but this pattern repeated for the tables I did test.
There are also a few tables i'm still not finding:
- all_subgraphs
- top_subgraph_items
- top_subgraph_triples
- wikibase_item
Change #1081935 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/deployment-charts@master] Datahub: Increase the RAM for the datahub restore-incides job
Change #1081935 merged by jenkins-bot:
[operations/deployment-charts@master] Datahub: Increase the RAM for the datahub restore-incides job
Oh dear, I seem to have caused some kind of problem with DataHub.
In light of the missing tables, I thought that I would try restoring the indices as per: https://datahubproject.io/docs/how/restore-indices/
Our local docs are here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/DataHub/Administration#Restore_Indices
Unfortunately, although this job has run to completion, the indices appear to be empty at the moment.
The data is all still there, but now zero results are returned from any search.
Change #1081948 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/deployment-charts@master] Datahub: disable standalone consumers
Change #1081948 abandoned by Btullis:
[operations/deployment-charts@master] Datahub: disable standalone consumers
Reason:
Not going to try this approach.
@EBernhardson - I think that this issue is resolved now. I can see 26 entites whilen filtering for Hive/discovery - Does that match what you expect to see?


