Page MenuHomePhabricator

Unable to find ingested tables in datahub
Closed, ResolvedPublic

Assigned To
Authored By
EBernhardson
Oct 7 2024, 8:35 PM
Referenced Files
F57633353: image.png
Oct 22 2024, 11:10 AM
F57633347: image.png
Oct 22 2024, 11:07 AM
F57630328: image.png
Oct 21 2024, 12:09 PM
Restricted File
Oct 18 2024, 10:05 AM
F57623685: image.png
Oct 18 2024, 10:04 AM

Description

In T374118 portions of the discovery hive database were ingested to datahub. The tables appear to be ingested, for example the discovery.cirrus_index table is fetchable from the UI. But none of the newly ingested tables are findable with the search functionality.

Event Timeline

BTullis triaged this task as Medium priority.
BTullis updated Other Assignee, added: EBernhardson.

I've also had the same experience with Airflow DAG/task objects.

Actually, while writing this message, I realized that the missing Airflow data was caused by a discrepancy between the Datahub version and the datahub client library version, fixed in https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/20

Tasks and DAGs seem to be searchable correctly now, FWIW.

When I do a search for discovery in DataHub I can see 17 hive tables returned.
{F57623689}
I exported the results as a CSV and the entity URLs are here:

However, I see that the table pattern would allow for more than 17 possible matches.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/analytics/dags/datahub/ingestion/configs/hive_discovery.yaml#L6-20

Before I go any further, does this search query match what you are seeing @EBernhardson?
If you don't see the same results, perhaps there is a problem with the permissins of the newly imported objects.

If you can see the same search results, are we able to identify any specific missing entites from the search results?

Curious, i can't say what changed but today I'm getting search results that I wasn't a week ago. Searching up a few tables by column names i'm familiar with is now finding them. In the general case this looks to be resolved.

There is still some odd behaviour though. If I search for cirrus_index the index comes up. If I filter the search for Datasets > Hive the index is no longer found. But the UI shows cirrus_index as Table | Hive | discovery . I didn't test all tables, but this pattern repeated for the tables I did test.

There are also a few tables i'm still not finding:

  • all_subgraphs
  • top_subgraph_items
  • top_subgraph_triples
  • wikibase_item

Change #1081935 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Datahub: Increase the RAM for the datahub restore-incides job

https://gerrit.wikimedia.org/r/1081935

Change #1081935 merged by jenkins-bot:

[operations/deployment-charts@master] Datahub: Increase the RAM for the datahub restore-incides job

https://gerrit.wikimedia.org/r/1081935

Oh dear, I seem to have caused some kind of problem with DataHub.

In light of the missing tables, I thought that I would try restoring the indices as per: https://datahubproject.io/docs/how/restore-indices/
Our local docs are here: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/DataHub/Administration#Restore_Indices

Unfortunately, although this job has run to completion, the indices appear to be empty at the moment.

image.png (575×1 px, 36 KB)

The data is all still there, but now zero results are returned from any search.

Change #1081948 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Datahub: disable standalone consumers

https://gerrit.wikimedia.org/r/1081948

Change #1081948 abandoned by Btullis:

[operations/deployment-charts@master] Datahub: disable standalone consumers

Reason:

Not going to try this approach.

https://gerrit.wikimedia.org/r/1081948

Curious, i can't say what changed but today I'm getting search results that I wasn't a week ago. Searching up a few tables by column names i'm familiar with is now finding them. In the general case this looks to be resolved.

There is still some odd behaviour though. If I search for cirrus_index the index comes up. If I filter the search for Datasets > Hive the index is no longer found. But the UI shows cirrus_index as Table | Hive | discovery . I didn't test all tables, but this pattern repeated for the tables I did test.

There are also a few tables i'm still not finding:

  • all_subgraphs
  • top_subgraph_items
  • top_subgraph_triples
  • wikibase_item

@EBernhardson - I think that this issue is resolved now. I can see 26 entites whilen filtering for Hive/discovery - Does that match what you expect to see?

image.png (946×767 px, 91 KB)

Actually, this daily DAG isn't found anywhere in datahub

I think that missing airflow DAG pipeline entity is also resolved.

image.png (639×824 px, 80 KB)