Goal: Connect MVP to a single data source.
Success Criteria:
- Have Hive metastore connected to and visible in the MVP
Goal: Connect MVP to a single data source.
Success Criteria:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | BTullis | T299910 Data Catalog MVP | |||
Resolved | Milimetric | T299897 Connect MVP to Hive metastore [Mile Stone 4] |
Note to self mostly: I have opened a few threads in DataHub slack about push-based ingestion. It looks like we have to write it ourselves, but I'm following up with a few people who seem to have done that, to see if they want to collaborate on an official solution. The pull-based ingestion seems simple enough, and we probably need it anyway (per Lambda Architecture principles). But it would be nice to have push, we just probably need to write our own emitter as a hive hook.
https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/recipes/example_to_datahub_kafka.yml
https://towardsdatascience.com/apache-hive-hooks-and-metastore-listeners-a-tale-of-your-metadata-903b751ee99f
https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub/#datahub-kafka
Ben got everything running! I'm starting Hive Ingestion inspired by T299703#7662929.
First shot, just event_sanitized with profiling on to see how bad it is
source: type: 'hive' config: host_port: analytics-hive.eqiad.wmnet:10000 database: event_sanitized profiling: enabled: true query_combiner_enabled: true turn_off_expensive_profiling_metrics: true profile_table_level_only: true options: connect_args: auth: 'KERBEROS' kerberos_service_name: hive sink: type: 'datahub-rest' config: server: 'https://datahub-gms.discovery.wmnet:30443'
Errors like this:
[2022-04-14 21:26:11,826] ERROR {datahub.ingestion.run.pipeline:93} - failed to write record with workunit event_sanitized.centralnoticebannerhistory with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167) at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) ... at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed ... ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:167) at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) ... at java.lang.Thread.run(Thread.java:750)\nCaused by: com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed at com.linkedin.metadata.entity.ValidationUtils.lambda$validateOrThrow$0(ValidationUtils.java:19) at com.linkedin.metadata.entity.RecordTemplateValidator.validate(RecordTemplateValidator.java:37) at com.linkedin.metadata.entity.ValidationUtils.validateOrThrow(ValidationUtils.java:17) at com.linkedin.metadata.resources.entity.EntityResource.ingest(EntityResource.java:165) ... 82 more\n', 'message': 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class com.linkedin.entity.Entity: ERROR :: /value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/1/com.linkedin.dataset.DatasetProperties/name :: unrecognized field found but not allowed\n', 'status': 422}
Thanks Dan. I wonder if it's anything to do with karapace, as opposed to schema-registry.
We can look at all of the application logs in Logstash. Filter on namespace datahub.
We can also look at the karapace application logs, to see if there are any clues there.
Ultimately it would be good to run the same test again with schema-registry instead of karapace, but that'll take a little setting up.
You could also try another test with an output sink of 'console' instead of 'datahub-rest'. Eyeball it at this point.
I think it makes sense to look at the karapace logs. I tried it with 'console' as the sink and it worked fine, no failures. And I cleaned out my personal database, made a single very simple two-column table, and tried to ingest that and got the same error. I'm also going to post this on their slack to see if they have any ideas, but karapace logs are what I'll look at tomorrow.
On their slack they said this looked like we had mismatching client/server versions. So maybe it's possible 0.8.32 is not fully rolled out somehow? I'll try rolling back the client to 0.8.28 I believe you said was the last version (couldn't figure out how to check via REST yet, I'll look into that more)
Aha! I was wrong, server must still be on 0.8.28, I rolled back the datahub client to 0.8.28 and ingestion started working. All good then, I'll do one database at a time and poke around. Awesssoooomme :))
Fantastic! I didn't think of that. Yes I failed to compete the server upgrade on Thursday, primarily because of our heavyweight build process, but it's almost ready to go.
I can do it early on Tuesday if it helps, or I could wait for your say-so.
It's all yours after today, so you can definitely upgrade on Tuesday. I'm going to leave some ingestion running at the end of the day, but that should finish in a few hours.
Just doing a few more kicks of the tires, seeing what happens when tables get changed, how profiling works and how slow it is, etc.
Take a look, it's pretty neat (maybe we can do something about the somewhat clumsy CSS styles, I looked into that a bit too).
When I get back I'll write an airflow job that does the ingestion on a regular basis. For now I'd just like @EChetty and @odimitrijevic to take a look and let me know their thoughts on the set of databases we chose to ingest (event, event_sanitized, wmf, wmf_raw, canonical_data), the frequency that we think we want to do this at, and anything else that comes to mind.
https://datahub.wikimedia.org/
(I'm so psyched Ben got this working, all the kudos!)