⚓ T299897 Connect MVP to Hive metastore [Mile Stone 4]

		Status	Subtype	Assigned	Task
		Resolved		BTullis	T299910 Data Catalog MVP
		Resolved		Milimetric	T299897 Connect MVP to Hive metastore [Mile Stone 4]

• EChetty created this task.Jan 24 2022, 11:20 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 24 2022, 11:20 AM

• EChetty added a project: Data-Engineering.Jan 24 2022, 1:06 PM

odimitrijevic moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.Feb 1 2022, 11:43 PM

Milimetric renamed this task from Connect MVP to a Data Source [Mile Stone 4] to Connect MVP to Hive metastore [Mile Stone 4].Feb 14 2022, 5:02 PM

Milimetric updated the task description. (Show Details)

Milimetric moved this task from Backlog to Next Up on the Data-Catalog board.

• EChetty assigned this task to Milimetric.Feb 22 2022, 6:34 PM

• EChetty moved this task from Next Up to In Progress on the Data-Catalog board.

Note to self mostly: I have opened a few threads in DataHub slack about push-based ingestion. It looks like we have to write it ourselves, but I'm following up with a few people who seem to have done that, to see if they want to collaborate on an official solution. The pull-based ingestion seems simple enough, and we probably need it anyway (per Lambda Architecture principles). But it would be nice to have push, we just probably need to write our own emitter as a hive hook.

https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/recipes/example_to_datahub_kafka.yml
https://towardsdatascience.com/apache-hive-hooks-and-metastore-listeners-a-tale-of-your-metadata-903b751ee99f
https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub/#datahub-kafka

Ben got everything running! I'm starting Hive Ingestion inspired by T299703#7662929.

First shot, just event_sanitized with profiling on to see how bad it is

source:
  type: 'hive'
  config:
    host_port: analytics-hive.eqiad.wmnet:10000
    database: event_sanitized
    profiling:
      enabled: true
      query_combiner_enabled: true
      turn_off_expensive_profiling_metrics: true
      profile_table_level_only: true
    options:
      connect_args:
        auth: 'KERBEROS'
        kerberos_service_name: hive
sink:
  type: 'datahub-rest'
  config:
    server: 'https://datahub-gms.discovery.wmnet:30443'

Errors like this:

[2022-04-14 21:26:11,826] ERROR    {datahub.ingestion.run.pipeline:93} - failed to write record with workunit event_sanitized.centralnoticebannerhistory with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167)
    at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
...
    at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
...
    ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167)
    at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source)
...
    at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed
    at com.linkedin.metadata.entity.ValidationUtils.lambda$validateOrThrow$0(ValidationUtils.java:19)
    at com.linkedin.metadata.entity.RecordTemplateValidator.validate(RecordTemplateValidator.java:37)
    at com.linkedin.metadata.entity.ValidationUtils.validateOrThrow(ValidationUtils.java:17)
    at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:165)
    ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}

Thanks Dan. I wonder if it's anything to do with karapace, as opposed to schema-registry.

We can look at all of the application logs in Logstash. Filter on namespace datahub.

We can also look at the karapace application logs, to see if there are any clues there.

Ultimately it would be good to run the same test again with schema-registry instead of karapace, but that'll take a little setting up.

You could also try another test with an output sink of 'console' instead of 'datahub-rest'. Eyeball it at this point.

I think it makes sense to look at the karapace logs. I tried it with 'console' as the sink and it worked fine, no failures. And I cleaned out my personal database, made a single very simple two-column table, and tried to ingest that and got the same error. I'm also going to post this on their slack to see if they have any ideas, but karapace logs are what I'll look at tomorrow.

On their slack they said this looked like we had mismatching client/server versions. So maybe it's possible 0.8.32 is not fully rolled out somehow? I'll try rolling back the client to 0.8.28 I believe you said was the last version (couldn't figure out how to check via REST yet, I'll look into that more)

Aha! I was wrong, server must still be on 0.8.28, I rolled back the datahub client to 0.8.28 and ingestion started working. All good then, I'll do one database at a time and poke around. Awesssoooomme :))

In T299897#7857881, @Milimetric wrote:

Aha! I was wrong, server must still be on 0.8.28, I rolled back the datahub client to 0.8.28 and ingestion started working. All good then, I'll do one database at a time and poke around. Awesssoooomme :))

Fantastic! I didn't think of that. Yes I failed to compete the server upgrade on Thursday, primarily because of our heavyweight build process, but it's almost ready to go.

I can do it early on Tuesday if it helps, or I could wait for your say-so.

It's all yours after today, so you can definitely upgrade on Tuesday. I'm going to leave some ingestion running at the end of the day, but that should finish in a few hours.

Just doing a few more kicks of the tires, seeing what happens when tables get changed, how profiling works and how slow it is, etc.

Notes from the Field, Ingestion edition

event_sanitized, no profiling: 75 minutes, each table takes about 15-20 seconds once it gets going, 132 tables in total. But times are super weird, look below:
wmf: 12 minutes, 54 tables
wmf_raw: 11 minutes, 45 tables
canonical_data: 1 second, 3 tables
event: 11 minutes, 258 tables
profiling failed because partition clauses are required and the profiler doesn't set hive.mapred.mode away from strict
triple checked there is no way to preview data from DataHub right now. To even ingest metadata, we send it via REST, it has no way of connecting to our data stores and running a query. It looks like there's a grayed out tab in the UI to allow this via some configuration, but I'm confident we don't have anything configured right now
I'll be slowly ingesting all of wmf, wmf_raw, event, event_sanitized, and canonical_data into our MVP

Take a look, it's pretty neat (maybe we can do something about the somewhat clumsy CSS styles, I looked into that a bit too).

Milimetric moved this task from In Progress to In Review on the Data-Catalog board.Apr 18 2022, 10:31 AM

When I get back I'll write an airflow job that does the ingestion on a regular basis. For now I'd just like @EChetty and @odimitrijevic to take a look and let me know their thoughts on the set of databases we chose to ingest (event, event_sanitized, wmf, wmf_raw, canonical_data), the frequency that we think we want to do this at, and anything else that comes to mind.

https://datahub.wikimedia.org/

(I'm so psyched Ben got this working, all the kudos!)

BTullis added a parent task: T299910: Data Catalog MVP.Apr 26 2022, 3:12 PM

• EChetty moved this task from In Review to Done on the Data-Catalog board.May 5 2022, 3:58 PM

JArguello-WMF edited projects, added Data-Engineering-Planning; removed Data-Engineering.Jun 30 2022, 3:15 PM

JArguello-WMF edited projects, added Data-Engineering-Planning (Sprint 01); removed Data-Engineering-Planning.Jun 30 2022, 3:34 PM

JArguello-WMF moved this task from Ready to In progress on the Data-Engineering-Planning (Sprint 01) board.Jun 30 2022, 3:40 PM

• EChetty moved this task from In progress to Done on the Data-Engineering-Planning (Sprint 01) board.Jun 30 2022, 5:12 PM

JArguello-WMF closed this task as Resolved.Jul 25 2022, 9:37 AM