Page MenuHomePhabricator

Propagate field descriptions from event schemas to Hive event tables
Open, MediumPublic5 Estimated Story Points

Description

As a user of DataHub I expect the columns of the tables in the event database to automatically have descriptions based on the corresponding schemas. @Milimetric suggested that since the hive metastore in ingested into DataHub what we actually need is to populate the metastore table metadata with field descriptions from the corresponding schemas

Event Timeline

EChetty set the point value for this task to 2.May 5 2022, 4:15 PM
EChetty changed the point value for this task from 2 to 5.Aug 17 2022, 2:17 PM

Indeed! The JSONSchema -> Spark Schema code does use the descriptions, but I think we are losing them when we map to Hive DDL: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/sql/HiveExtensions.scala#L124-L134.

I think this would automatically just work if we could create/alter the tables through Spark directly, rather than through Hive.

Once we are fully on Spark 3, we should be able to do T209453: Refine: Use Spark SQL instead of Hive JDBC , which I think should address this problem.

e.g. there are enriched fields in the Hive tables, like geocoded_data, etc.

How would the DataHub description for geocoded_data (and any other fields that get added in refinement) get populated?

They'd have to be set when the field is added to the table the first time. If a new event stream is declared, Refine will pick it up CREATE the table with the geocoded_data field. Since Refine is adding the field, it needs to set the description there.

Not exactly sure how that would work, but probably somewhere in here, augment the workingDf.schema to set the description for the newly added geocoded_data field.

For existent Hive event tables, we'd have to do a big ALTER TABLE set comment kind on all of them.

DataHub has an API and we can use to import the schema. That schema should ideally be tied to the Kafka topics as this is the true source.

This task is about the event tables in Hive, for which most fields are indeed created from the event schemas, but not all. If we properly propagated the field descriptions from the event schemas to the Hive tables, then the descriptions would just be imported into Datahub from Hive via the regular import process.

T318863: [Event Platform] Event Platform and DataHub Integration is about cataloging the Event Platform streams in Kafka with their event schemas.

@Ottomata makes sense. Thanks for posting the ticket

Ottomata renamed this task from Propagate field descriptions from event schemas to metastore to Propagate field descriptions from event schemas to Hive event tables.May 30 2023, 5:02 PM
xcollazo subscribed.

Just passing by to say that it would be nice to see this ticket happen. CC @lbowmaker.

Change 987195 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] spark HiveExtensions now support column COMMENTs in DDL and merge helpers

https://gerrit.wikimedia.org/r/987195

I think this would automatically just work if we could create/alter the tables through Spark directly, rather than through Hive.

So, while I think this is true, T209453: Refine: Use Spark SQL instead of Hive JDBC is proving not easy in Spark 3 either. Looks like we'll need some changes in upstream Spark.

However, since I was poking around in the code, I think it is possible to solve this task without the Spark SQL ALTER DDL support we need. I just pushed a patch to refinery-source HiveExtensions that I think should do what we need. It will also have the really nice effect of adding the field descriptions to all of the Refined event tables! ...will test this!

Wow it...kinda...works~

CREATE TABLE otto.mw_page_change0 LIKE event.mediawiki_page_change_v1;

Then I ran our EvolveHiveTable tool to alter my table. Here's the output of the tool with all the generated ALTER statements.

The result!

hive (otto)> describe mw_page_change0;
OK
col_name	data_type	comment
_schema             	string              	A URI identifying the JSONSchema for this event. This should match an schema's $id in a schema repository. E.g. /schema/title/1.0.0
changelog_kind      	string              	The kind of this event in a changelog. This is used to map the event to an action in a data store.
comment             	string              	The comment left by the user that performed this change. Same as revision.comment on edits.
created_redirect_page	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>	Page entity that was created at the old title during a page move. This is only set for page move events. Note that the created_redirect_page will also have its own associated page create event.
dt                  	string              	ISO-8601 formatted timestamp of when the event occurred/was generated in UTC), AKA 'event time'. This is different than meta.dt, which is used as the time the system received this event.
meta                	struct<domain:string,dt:string,id:string,request_id:string,stream:string,uri:string>
page                	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint,redirect_page_link:struct<interwiki_prefix:string,is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string>>	Fields for MediaWiki page entity.
page_change_kind    	string              	The origin kind of the change to this page as viewed by MediaWiki.
performer           	struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>	Represents the MediaWiki actor that made this change. If this change is an edit, this will be the same as revision.editor.
prior_state         	struct<page:struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>,revision:struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>>	Prior state of this page before this event. Fields are only present if their values have changed.
revision            	struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>	Fields for MediaWiki revision entity.
wiki_id             	string              	The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.
is_wmf_domain       	boolean
normalized_host     	struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string>
datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

# Partition Information
# col_name            	data_type           	comment

datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint
Time taken: 0.313 seconds, Fetched: 28 row(s)
hive (otto)> describe mediawiki_page_change_v1;
FAILED: SemanticException [Error 10001]: Table not found mediawiki_page_change_v1
hive (otto)> describe event.mediawiki_page_change_v1;
OK
col_name	data_type	comment
_schema             	string
changelog_kind      	string
comment             	string
created_redirect_page	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>
dt                  	string
meta                	struct<domain:string,dt:string,id:string,request_id:string,stream:string,uri:string>
page                	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint,redirect_page_link:struct<interwiki_prefix:string,is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string>>
page_change_kind    	string
performer           	struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>
prior_state         	struct<page:struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>,revision:struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>>
revision            	struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>
wiki_id             	string
is_wmf_domain       	boolean
normalized_host     	struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string>

...

It looks like the comments were only added for top level fields, even though the alter table change column included them?

I just did a test in pure Hive, and indeed, sub struct field comments don't seem to be retained.

CREATE TABLE otto.nested1 (c1 struct<f1:int COMMENT "f1 comment"> COMMENT "c1 comment") STORED AS PARQUET;

hive (otto)> describe otto.nested1;
OK
col_name	data_type	comment
c1                  	struct<f1:int>      	c1 comment

As you can see, there is no comment for sub field c1.f1.

Datahub allows you to add descriptions at sub-field level. We should at some point get to consensus about where we want all this description stuff to live. We talked about:

  • table-creation code (direct hql or indirect via schemas) being the primer, but once it's in Datahub that being the authoritative source of truth. The two interplay nicely as long as changes are backwards compatible, and manual intervention is needed in backwards-incompatible cases.
  • driving everything from the create script/schema and syncing the stuff people put in Datahub back to the code (I have doubts about this approach)

But we should decide sometime soon

we should decide sometime soon

Aye, prob a different ticket.

syncing the stuff people put in Datahub back to the code

Many of the field descriptions come from come from many common/shared fragment schemas. If people want changes in the documentation of a field, they should definitely update it in the source schemas. New schemas will be created using the fragments, causing the 'old' description to be used for the same field in a new schema.

More generally I think things should be synced back anyway. Datahub is not the only way people get info about e.g. Hive tables. E.g. I don't think Datahub will stop people from ever doing describe table.

Change #1016808 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Add CLI to create or update Iceberg tables

https://gerrit.wikimedia.org/r/1016808