@JAllemandou and I just chatted for a while, and realized that this partition / revision schema change stuff is difficult. In order to continue with our experiments in automating arbitrary JSON data Hive querying, we'll need to be able to ALTER existing hive tables based on new fields in JSON data. We need a class that works something like the following:
# Class name TBD val schemaDiffer = new SparkHiveSchemaDiffer(originalSchema: StructType, otherSchemas: Seq[StructType]) # get a StructType that contains the union of all fields in the schemas. val unionSchema: StructType = schemaDiffer.getUnion() # Return a Hive CREATE TABLE (external table if location is given?) statement that can be used # to create a hive table with all of the fields in all of the given schemas. val createHiveTableDDL: String = schemaDiffer.getHiveCreateDDL(partitions?, location) # Return a Hive ALTER TABLE statement that will add fields to the orignalSchema. val alterHiveTableDDL: String = schemaDiffer.getHiveAlterDDL()
Joseph said he'd take a crack at making this class. (The class names methods and parameter names above are all TBD, don't use them just because I wrote them!)