Page MenuHomePhabricator

Write Spark schema differ / Hive DDL generator
Closed, ResolvedPublic5 Estimated Story Points

Description

@JAllemandou and I just chatted for a while, and realized that this partition / revision schema change stuff is difficult. In order to continue with our experiments in automating arbitrary JSON data Hive querying, we'll need to be able to ALTER existing hive tables based on new fields in JSON data. We need a class that works something like the following:

# Class name TBD
val schemaDiffer = new SparkHiveSchemaDiffer(originalSchema: StructType, otherSchemas: Seq[StructType])

# get a StructType that contains the union of all fields in the schemas.
val unionSchema: StructType = schemaDiffer.getUnion()

# Return a Hive CREATE TABLE (external table if location is given?) statement that can be used
# to create a hive table with all of the fields in all of the given schemas.
val createHiveTableDDL: String = schemaDiffer.getHiveCreateDDL(partitions?, location)

# Return a Hive ALTER TABLE statement that will add fields to the orignalSchema.
val alterHiveTableDDL: String = schemaDiffer.getHiveAlterDDL()

Joseph said he'd take a crack at making this class. (The class names methods and parameter names above are all TBD, don't use them just because I wrote them!)

Event Timeline

Change 346291 had a related patch set uploaded (by Joal):
[analytics/refinery/source@master] [WIP] Add Spark schema handler to refinery-core

https://gerrit.wikimedia.org/r/346291

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

I moved this to done, since Joseph wrote a good prototype for this. This helped me with T153328. We'll make a new ticket to track implementation, which will include Joseph's prototype code.

Change 346291 merged by Ottomata:
[analytics/refinery/source@master] JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables

https://gerrit.wikimedia.org/r/346291