Page MenuHomePhabricator

Connect MVP to Hive metastore [Mile Stone 4]
Closed, ResolvedPublic

Description

Goal: Connect MVP to a single data source.

Success Criteria:

  • Have Hive metastore connected to and visible in the MVP

Event Timeline

Milimetric renamed this task from Connect MVP to a Data Source [Mile Stone 4] to Connect MVP to Hive metastore [Mile Stone 4].Feb 14 2022, 5:02 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Backlog to Next Up on the Data-Catalog board.

Note to self mostly: I have opened a few threads in DataHub slack about push-based ingestion. It looks like we have to write it ourselves, but I'm following up with a few people who seem to have done that, to see if they want to collaborate on an official solution. The pull-based ingestion seems simple enough, and we probably need it anyway (per Lambda Architecture principles). But it would be nice to have push, we just probably need to write our own emitter as a hive hook.

https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/recipes/example_to_datahub_kafka.yml
https://towardsdatascience.com/apache-hive-hooks-and-metastore-listeners-a-tale-of-your-metadata-903b751ee99f
https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub/#datahub-kafka

Ben got everything running! I'm starting Hive Ingestion inspired by T299703#7662929.

First shot, just event_sanitized with profiling on to see how bad it is

source:
  type: 'hive'
  config:
    host_port: analytics-hive.eqiad.wmnet:10000
    database: event_sanitized
    profiling:
      enabled: true
      query_combiner_enabled: true
      turn_off_expensive_profiling_metrics: true
      profile_table_level_only: true
    options:
      connect_args:
        auth: 'KERBEROS'
        kerberos_service_name: hive
sink:
  type: 'datahub-rest'
  config:
    server: 'https://datahub-gms.discovery.wmnet:30443'

Errors like this:

[2022-04-14 21:26:11,826] ERROR    {datahub.ingestion.run.pipeline:93} - failed to write record with workunit event_sanitized.centralnoticebannerhistory with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167)
    at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
...
    at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
...
    ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167)
    at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
...
    at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.entity.ValidationUtils.lambda$validateOrThrow$0(ValidationUtils.java:19)
    at com.linkedin.metadata.entity.RecordTemplateValidator.validate(RecordTemplateValidator.java:37)
    at com.linkedin.metadata.entity.ValidationUtils.validateOrThrow(ValidationUtils.java:17)
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:165)
    ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}

Thanks Dan. I wonder if it's anything to do with karapace, as opposed to schema-registry.

We can look at all of the application logs in Logstash. Filter on namespace datahub.

We can also look at the karapace application logs, to see if there are any clues there.

Ultimately it would be good to run the same test again with schema-registry instead of karapace, but that'll take a little setting up.

You could also try another test with an output sink of 'console' instead of 'datahub-rest'. Eyeball it at this point.

I think it makes sense to look at the karapace logs. I tried it with 'console' as the sink and it worked fine, no failures. And I cleaned out my personal database, made a single very simple two-column table, and tried to ingest that and got the same error. I'm also going to post this on their slack to see if they have any ideas, but karapace logs are what I'll look at tomorrow.

On their slack they said this looked like we had mismatching client/server versions. So maybe it's possible 0.8.32 is not fully rolled out somehow? I'll try rolling back the client to 0.8.28 I believe you said was the last version (couldn't figure out how to check via REST yet, I'll look into that more)

Aha! I was wrong, server must still be on 0.8.28, I rolled back the datahub client to 0.8.28 and ingestion started working. All good then, I'll do one database at a time and poke around. Awesssoooomme :))

Aha! I was wrong, server must still be on 0.8.28, I rolled back the datahub client to 0.8.28 and ingestion started working. All good then, I'll do one database at a time and poke around. Awesssoooomme :))

Fantastic! I didn't think of that. Yes I failed to compete the server upgrade on Thursday, primarily because of our heavyweight build process, but it's almost ready to go.

I can do it early on Tuesday if it helps, or I could wait for your say-so.

It's all yours after today, so you can definitely upgrade on Tuesday. I'm going to leave some ingestion running at the end of the day, but that should finish in a few hours.

Just doing a few more kicks of the tires, seeing what happens when tables get changed, how profiling works and how slow it is, etc.

Notes from the Field, Ingestion edition
  • event_sanitized, no profiling: 75 minutes, each table takes about 15-20 seconds once it gets going, 132 tables in total. But times are super weird, look below:
  • wmf: 12 minutes, 54 tables
  • wmf_raw: 11 minutes, 45 tables
  • canonical_data: 1 second, 3 tables
  • event: 11 minutes, 258 tables
  • profiling failed because partition clauses are required and the profiler doesn't set hive.mapred.mode away from strict
  • triple checked there is no way to preview data from DataHub right now. To even ingest metadata, we send it via REST, it has no way of connecting to our data stores and running a query. It looks like there's a grayed out tab in the UI to allow this via some configuration, but I'm confident we don't have anything configured right now
  • I'll be slowly ingesting all of wmf, wmf_raw, event, event_sanitized, and canonical_data into our MVP

Take a look, it's pretty neat (maybe we can do something about the somewhat clumsy CSS styles, I looked into that a bit too).

When I get back I'll write an airflow job that does the ingestion on a regular basis. For now I'd just like @EChetty and @odimitrijevic to take a look and let me know their thoughts on the set of databases we chose to ingest (event, event_sanitized, wmf, wmf_raw, canonical_data), the frequency that we think we want to do this at, and anything else that comes to mind.

https://datahub.wikimedia.org/

(I'm so psyched Ben got this working, all the kudos!)