Propagate field descriptions from event schemas to Hive event tables
Open, MediumPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	mpopov
	Apr 27 2022, 9:10 PM

Description

As a user of DataHub I expect the columns of the tables in the event database to automatically have descriptions based on the corresponding schemas. @Milimetric suggested that since the hive metastore in ingested into DataHub what we actually need is to populate the metastore table metadata with field descriptions from the corresponding schemas

For Event Platform tables created out of configured streams such as mediawiki.skin_diff I would expect event.mediawiki_skin_diff to use analytics/pref_diff/latest.json
Likewise for migrated EventLogging schemas (e.g. eventlogging_SearchSatisfaction stream, event.searchsatisfaction table, analytics/legacy/searchsatisfaction schema latest.json)
Since the intention is to eventually shut down legacy EventLogging, I don't think we need to update metastore with details from not-yet-migrated legacy EL schemas on Meta wiki

Details

	Subject	Repo	Branch	Lines +/-
	Add CLI to create or update Iceberg tables	analytics/refinery/source	master	+1 K -557
	spark HiveExtensions now support column COMMENTs in DDL and merge helpers	analytics/refinery/source	master	+97 -32

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Duplicate	None	T291465 Analytics-test-hadoop Spark3 package upgrade
Duplicate	None	T291466 Analytics-hadoop Spark3 package upgrade (production)
Resolved	JAllemandou	T306955 Spark3 migration - Currently existing airflow jobs
Open	None	T291386 Upgrade Refinery Jobs to Spark 3
Open	None	T209453 Refine: Use Spark SQL instead of Hive JDBC
Open	None	T307040 Propagate field descriptions from event schemas to Hive event tables
Open	None	T255818 Refine drops $schema field values
Open	JAllemandou	T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs

Event Timeline

mpopov created this task.Apr 27 2022, 9:10 PM

mpopov moved this task from Triage to Tracking on the Product-Analytics board.Apr 28 2022, 1:29 PM

BTullis subscribed.Apr 29 2022, 9:34 AM

• EChetty moved this task from Backlog to Next Up on the Data-Catalog board.May 5 2022, 3:58 PM

• EChetty set the point value for this task to 2.May 5 2022, 4:15 PM

• EChetty moved this task from Next Up to Backlog on the Data-Catalog board.May 10 2022, 6:02 PM

JArguello-WMF moved this task from Incoming (new tickets) to Security & Governance on the Data-Engineering board.May 11 2022, 6:35 PM

• EChetty changed the point value for this task from 2 to 5.Aug 17 2022, 2:17 PM

mpopov mentioned this in T318863: [Event Platform] Event Platform and DataHub Integration.Oct 20 2022, 2:30 PM

Indeed! The JSONSchema -> Spark Schema code does use the descriptions, but I think we are losing them when we map to Hive DDL: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/sql/HiveExtensions.scala#L124-L134.

I think this would automatically just work if we could create/alter the tables through Spark directly, rather than through Hive.

Once we are fully on Spark 3, we should be able to do T209453: Refine: Use Spark SQL instead of Hive JDBC , which I think should address this problem.

In T318863#8333367, @Ottomata wrote:

e.g. there are enriched fields in the Hive tables, like geocoded_data, etc.

How would the DataHub description for geocoded_data (and any other fields that get added in refinement) get populated?

They'd have to be set when the field is added to the table the first time. If a new event stream is declared, Refine will pick it up CREATE the table with the geocoded_data field. Since Refine is adding the field, it needs to set the description there.

Not exactly sure how that would work, but probably somewhere in here, augment the workingDf.schema to set the description for the newly added geocoded_data field.

For existent Hive event tables, we'd have to do a big ALTER TABLE set comment kind on all of them.

DataHub has an API and we can use to import the schema. That schema should ideally be tied to the Kafka topics as this is the true source.

This task is about the event tables in Hive, for which most fields are indeed created from the event schemas, but not all. If we properly propagated the field descriptions from the event schemas to the Hive tables, then the descriptions would just be imported into Datahub from Hive via the regular import process.

T318863: [Event Platform] Event Platform and DataHub Integration is about cataloging the Event Platform streams in Kafka with their event schemas.

@Ottomata makes sense. Thanks for posting the ticket

Ottomata renamed this task from Propagate field descriptions from event schemas to metastore to Propagate field descriptions from event schemas to Hive event tables.May 30 2023, 5:02 PM

Just passing by to say that it would be nice to see this ticket happen. CC @lbowmaker.

Ottomata added a parent task: T209453: Refine: Use Spark SQL instead of Hive JDBC .Jun 21 2023, 4:00 PM

JArguello-WMF removed a project: Data-Catalog.Jun 29 2023, 3:42 PM

JArguello-WMF triaged this task as Medium priority.Jun 29 2023, 9:05 PM

JArguello-WMF moved this task from Security & Governance to Incoming (new tickets) on the Data-Engineering board.Jun 29 2023, 11:44 PM

JArguello-WMF moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Jun 29 2023, 11:51 PM

lbowmaker moved this task from Radar (External Teams) to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 1:23 PM

Change 987195 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] spark HiveExtensions now support column COMMENTs in DDL and merge helpers

https://gerrit.wikimedia.org/r/987195

gerritbot added a project: Patch-For-Review.Jan 2 2024, 9:17 PM

I think this would automatically just work if we could create/alter the tables through Spark directly, rather than through Hive.

So, while I think this is true, T209453: Refine: Use Spark SQL instead of Hive JDBC is proving not easy in Spark 3 either. Looks like we'll need some changes in upstream Spark.

However, since I was poking around in the code, I think it is possible to solve this task without the Spark SQL ALTER DDL support we need. I just pushed a patch to refinery-source HiveExtensions that I think should do what we need. It will also have the really nice effect of adding the field descriptions to all of the Refined event tables! ...will test this!

Wow it...kinda...works~

CREATE TABLE otto.mw_page_change0 LIKE event.mediawiki_page_change_v1;

Then I ran our EvolveHiveTable tool to alter my table. Here's the output of the tool with all the generated ALTER statements.

The result!

hive (otto)> describe mw_page_change0;
OK
col_name	data_type	comment
_schema             	string              	A URI identifying the JSONSchema for this event. This should match an schema's $id in a schema repository. E.g. /schema/title/1.0.0
changelog_kind      	string              	The kind of this event in a changelog. This is used to map the event to an action in a data store.
comment             	string              	The comment left by the user that performed this change. Same as revision.comment on edits.
created_redirect_page	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>	Page entity that was created at the old title during a page move. This is only set for page move events. Note that the created_redirect_page will also have its own associated page create event.
dt                  	string              	ISO-8601 formatted timestamp of when the event occurred/was generated in UTC), AKA 'event time'. This is different than meta.dt, which is used as the time the system received this event.
meta                	struct<domain:string,dt:string,id:string,request_id:string,stream:string,uri:string>
page                	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint,redirect_page_link:struct<interwiki_prefix:string,is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string>>	Fields for MediaWiki page entity.
page_change_kind    	string              	The origin kind of the change to this page as viewed by MediaWiki.
performer           	struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>	Represents the MediaWiki actor that made this change. If this change is an edit, this will be the same as revision.editor.
prior_state         	struct<page:struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>,revision:struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>>	Prior state of this page before this event. Fields are only present if their values have changed.
revision            	struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>	Fields for MediaWiki revision entity.
wiki_id             	string              	The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.
is_wmf_domain       	boolean
normalized_host     	struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string>
datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint

# Partition Information
# col_name            	data_type           	comment

datacenter          	string
year                	bigint
month               	bigint
day                 	bigint
hour                	bigint
Time taken: 0.313 seconds, Fetched: 28 row(s)
hive (otto)> describe mediawiki_page_change_v1;
FAILED: SemanticException [Error 10001]: Table not found mediawiki_page_change_v1
hive (otto)> describe event.mediawiki_page_change_v1;
OK
col_name	data_type	comment
_schema             	string
changelog_kind      	string
comment             	string
created_redirect_page	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>
dt                  	string
meta                	struct<domain:string,dt:string,id:string,request_id:string,stream:string,uri:string>
page                	struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint,redirect_page_link:struct<interwiki_prefix:string,is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string>>
page_change_kind    	string
performer           	struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>
prior_state         	struct<page:struct<is_redirect:boolean,namespace_id:bigint,page_id:bigint,page_title:string,revision_count:bigint>,revision:struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>>
revision            	struct<comment:string,content_slots:map<string,struct<content_body:string,content_format:string,content_model:string,content_sha1:string,content_size:bigint,origin_rev_id:bigint,slot_role:string>>,editor:struct<edit_count:bigint,groups:array<string>,is_bot:boolean,is_system:boolean,is_temp:boolean,registration_dt:string,user_id:bigint,user_text:string>,is_comment_visible:boolean,is_content_visible:boolean,is_editor_visible:boolean,is_minor_edit:boolean,rev_dt:string,rev_id:bigint,rev_parent_id:bigint,rev_sha1:string,rev_size:bigint>
wiki_id             	string
is_wmf_domain       	boolean
normalized_host     	struct<project_class:string,project:string,qualifiers:array<string>,tld:string,project_family:string>

...

It looks like the comments were only added for top level fields, even though the alter table change column included them?

I just did a test in pure Hive, and indeed, sub struct field comments don't seem to be retained.

CREATE TABLE otto.nested1 (c1 struct<f1:int COMMENT "f1 comment"> COMMENT "c1 comment") STORED AS PARQUET;

hive (otto)> describe otto.nested1;
OK
col_name	data_type	comment
c1                  	struct<f1:int>      	c1 comment

As you can see, there is no comment for sub field c1.f1.

Datahub allows you to add descriptions at sub-field level. We should at some point get to consensus about where we want all this description stuff to live. We talked about:

table-creation code (direct hql or indirect via schemas) being the primer, but once it's in Datahub that being the authoritative source of truth. The two interplay nicely as long as changes are backwards compatible, and manual intervention is needed in backwards-incompatible cases.
driving everything from the create script/schema and syncing the stuff people put in Datahub back to the code (I have doubts about this approach)

But we should decide sometime soon

we should decide sometime soon

Aye, prob a different ticket.

syncing the stuff people put in Datahub back to the code

Many of the field descriptions come from come from many common/shared fragment schemas. If people want changes in the documentation of a field, they should definitely update it in the source schemas. New schemas will be created using the fragments, causing the 'old' description to be used for the same field in a new schema.

More generally I think things should be synced back anyway. Datahub is not the only way people get info about e.g. Hive tables. E.g. I don't think Datahub will stop people from ever doing describe table.

Change #1016808 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Add CLI to create or update Iceberg tables

https://gerrit.wikimedia.org/r/1016808

Propagate field descriptions from event schemas to Hive event tablesOpen, MediumPublic5 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Propagate field descriptions from event schemas to Hive event tables
Open, MediumPublic5 Estimated Story Points
Actions

Related Objects
Search...