Maniphest T161924

Write Spark schema differ / Hive DDL generator
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Mar 31 2017, 6:57 PM

Description

@JAllemandou and I just chatted for a while, and realized that this partition / revision schema change stuff is difficult. In order to continue with our experiments in automating arbitrary JSON data Hive querying, we'll need to be able to ALTER existing hive tables based on new fields in JSON data. We need a class that works something like the following:

# Class name TBD
val schemaDiffer = new SparkHiveSchemaDiffer(originalSchema: StructType, otherSchemas: Seq[StructType])

# get a StructType that contains the union of all fields in the schemas.
val unionSchema: StructType = schemaDiffer.getUnion()

# Return a Hive CREATE TABLE (external table if location is given?) statement that can be used
# to create a hive table with all of the fields in all of the given schemas.
val createHiveTableDDL: String = schemaDiffer.getHiveCreateDDL(partitions?, location)

# Return a Hive ALTER TABLE statement that will add fields to the orignalSchema.
val alterHiveTableDDL: String = schemaDiffer.getHiveAlterDDL()

Joseph said he'd take a crack at making this class. (The class names methods and parameter names above are all TBD, don't use them just because I wrote them!)

Details

	Subject	Repo	Branch	Lines +/-
	JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables	analytics/refinery/source	master	+2 K -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T159170 Sunset MySQL data store for eventlogging
Resolved	Ottomata	T162610 Implement EventLogging Hive refinement
Resolved	Ottomata	T153328 Research Spike: Better support for Eventlogging data on hive
Resolved	JAllemandou	T161924 Write Spark schema differ / Hive DDL generator

Event Timeline

Ottomata created this task.Mar 31 2017, 6:57 PM

Change 346291 had a related patch set uploaded (by Joal):
[analytics/refinery/source@master] [WIP] Add Spark schema handler to refinery-core

https://gerrit.wikimedia.org/r/346291

gerritbot added a project: Patch-For-Review.Apr 4 2017, 1:52 PM

I moved this to done, since Joseph wrote a good prototype for this. This helped me with T153328. We'll make a new ticket to track implementation, which will include Joseph's prototype code.

• Nuria closed this task as Resolved.Apr 21 2017, 6:46 PM

Change 346291 merged by Ottomata:
[analytics/refinery/source@master] JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables