Page MenuHomePhabricator

DataHub search indices are empty - making it effectively useless
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
Oct 21 2024, 4:37 PM
Referenced Files
F57633131: Screenshot 2024-10-22 at 11.17.54.png
Oct 22 2024, 9:18 AM
Restricted File
Oct 21 2024, 9:33 PM
F57631514: image.png
Oct 21 2024, 8:25 PM
F57630876: image.png
Oct 21 2024, 4:37 PM

Description

Whilst working on T376657: Unable to find ingested tables in datahub we initiated a job to rebuild the DataHub metadata indices from the contents of the MariaDB database as per: https://datahubproject.io/docs/how/restore-indices/

Although the job was successful (after modifying one mis-named entity) the search indices are empty at the end of the process.

image.png (1,919×575 px, 50 KB)

What seems to be happening is that the MAE events published by the job are never reaching the kafka-jumbo cluster.

We can subscribe to the MetadataAuditEvent_v4 topic like this:

btullis@stat1008:~$ kafkacat -C -b kafka-jumbo1010.eqiad.wmnet:9092 -t MetadataAuditEvent_v4 -o 0
% Reached end of topic MetadataAuditEvent_v4 [3] at offset 80794
% Reached end of topic MetadataAuditEvent_v4 [1] at offset 81132
% Reached end of topic MetadataAuditEvent_v4 [4] at offset 84162
% Reached end of topic MetadataAuditEvent_v4 [0] at offset 71331
% Reached end of topic MetadataAuditEvent_v4 [2] at offset 80985

...but it remains empty.

Similarly, the MetadataChangeEvent_v4 topic remains empty.

btullis@stat1008:~$ kafkacat -C -b kafka-jumbo1010.eqiad.wmnet:9092 -t MetadataChangeEvent_v4 -o 0
% Reached end of topic MetadataChangeEvent_v4 [0] at offset 0
% Reached end of topic MetadataChangeEvent_v4 [3] at offset 0
% Reached end of topic MetadataChangeEvent_v4 [4] at offset 0
% Reached end of topic MetadataChangeEvent_v4 [2] at offset 0
% Reached end of topic MetadataChangeEvent_v4 [1] at offset 11110

Event Timeline

BTullis triaged this task as Unbreak Now! priority.Oct 21 2024, 4:38 PM
BTullis moved this task from Incoming to SRE on the Data-Platform board.

I have downgraded to version 0.12.1 and it looks like things are better.

image.png (1,917×1,012 px, 97 KB)

I will run the reindexing job now.

The reindexing job did not fare any better under this version. I think that the greater number of elements in the UI comes from the fact that the code in version 0.12.1 queries the database more and version 0.13.3 queries the ES indices more.

We think that what's happened is that the topic names used by the mae-consumer and mce-consumer processes have changed over the different versions and we have not kept up with the changes in our custom helm charts.
@brouberol noticed a large backlog of events in the kafka-jumbo cluster for the MetadataChangeLog_Version_v1 topic.

If we look at the values.yaml file for the upstream charts, we see a list of topics:

topics:
  metadata_change_event_name: "MetadataChangeEvent_v4"
  failed_metadata_change_event_name: "FailedMetadataChangeEvent_v4"
  metadata_audit_event_name: "MetadataAuditEvent_v4"
  datahub_usage_event_name: "DataHubUsageEvent_v1"
  metadata_change_proposal_topic_name: "MetadataChangeProposal_v1"
  failed_metadata_change_proposal_topic_name: "FailedMetadataChangeProposal_v1"
  metadata_change_log_versioned_topic_name: "MetadataChangeLog_Versioned_v1"
  metadata_change_log_timeseries_topic_name: "MetadataChangeLog_Timeseries_v1"
  platform_event_topic_name: "PlatformEvent_v1"
  datahub_upgrade_history_topic_name: "DataHubUpgradeHistory_v1"

These topic names are then referenced and assigned to environment variables in the upstream charts:

We have the same list of topics in our chart but they are not referenced sufficiently in the subcharts, so we do not have environment variables defined and this is, presumably, why the topics are not being published to or ready from correctly.

The mce-consumer has an incomplete list, but the gms and mae-consumer have none.

btullis@deploy2002:~$ kubectl get pod datahub-mce-consumer-production-8b5b4b5b5-thbhw  -o json | jq .spec.containers[0].env|grep METADATA
    "name": "METADATA_CHANGE_EVENT_NAME",
    "name": "FAILED_METADATA_CHANGE_EVENT_NAME",
    "name": "METADATA_CHANGE_PROPOSAL_TOPIC_NAME",
    "name": "FAILED_METADATA_CHANGE_PROPOSAL_TOPIC_NAME",
    "name": "METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME",
    "name": "METADATA_CHANGE_LOG_TIMESERIES_TOPIC_NAME",
btullis@deploy2002:~$ kubectl get pod datahub-mae-consumer-production-c4547866b-zwckm -o json | jq .spec.containers[0].env|grep METADATA
btullis@deploy2002:~$ kubectl get pod datahub-gms-production-65ff76fd4b-7lnmg  -o json | jq .spec.containers[0].env|grep METADATA
btullis@deploy2002:~$

Weird. I upgraded to version 0.13.3 again and now the UI is working. I can manage permissions again.
{F57631659,width=40%}

Adding to what @BTullis said, here's the list of configured topics for the mce-consumer

brouberol@deploy2002:~$ kubectl get pod datahub-mce-consumer-production-576cb6bc58-79cxk  -o json | jq .spec.containers[0].env|grep  -C1 METADATA
  {
    "name": "METADATA_CHANGE_EVENT_NAME",
    "value": "MetadataChangeEvent_v4"
--
  {
    "name": "FAILED_METADATA_CHANGE_EVENT_NAME",
    "value": "FailedMetadataChangeEvent_v4"
--
  {
    "name": "METADATA_CHANGE_PROPOSAL_TOPIC_NAME",
    "value": "MetadataChangeProposal_v1"
--
  {
    "name": "FAILED_METADATA_CHANGE_PROPOSAL_TOPIC_NAME",
    "value": "FailedMetadataChangeProposal_v1"
--
  {
    "name": "METADATA_CHANGE_LOG_VERSIONED_TOPIC_NAME",
    "value": "MetadataChangeLog_Versioned_v1"
--
  {
    "name": "METADATA_CHANGE_LOG_TIMESERIES_TOPIC_NAME",
    "value": "MetadataChangeLog_Timeseries_v1"

We can see that MetadataAuditEvent_v4 topic is nowhere to be found in that configuration. However, that seems logical, as the mce-consumer deals with ... MCE (Metadata Change Events), whereas the mae-consumer deals with .. MAE, aka Metadata Audit Events. I'll have a look at the upstream charts to gather the default values used in each case.

Change #1082151 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] datahub: add metadata topics env vars to the gms and mae-consumer

https://gerrit.wikimedia.org/r/1082151

https://datahubproject.io/docs/metadata-jobs/mae-consumer-job/

The Metadata Audit Event Consumer is a Spring job which can be deployed by itself, or as part of the Metadata Service.
Its main function is to listen to change log events emitted as a result of changes made to the Metadata Graph, converting changes in the metadata model into updates against secondary search & graph indexes (among other things)
Today the job consumes from two important Kafka topics:

  • MetadataChangeLog_Versioned_v1
  • MetadataChangeLog_Timeseries_v1

Where does the name Metadata Audit Event come from? Well, history. Previously, this job consumed a single MetadataAuditEvent topic which has been deprecated and removed from the critical path. Hence, the name!

So that would explain why we're not seeing any activity in MetadataAuditEvent_v4, I think.

Change #1082151 merged by Brouberol:

[operations/deployment-charts@master] datahub: add metadata topics env vars to the gms and mae-consumer

https://gerrit.wikimedia.org/r/1082151

Manually deploying https://gerrit.wikimedia.org/r/1082151 seems to have allowed the MAE to consume MetadataChangeLog_Versioned_v1, and rebuild the indices!

Screenshot 2024-10-22 at 11.17.54.png (6,106×2,860 px, 1 MB)

I've redeployed datahub-next and datahub, and re-created a new datahub-restore-indices job. The MAE consumer kicked in and the UI started reflecting the updates. I think we can close the UBN,