JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables

Authored by Ottomata on Apr 4 2017, 1:49 PM.

Description

JsonRefine: refine arbitrary JSON datasets into Parquet backed hive tables

Given many config parameters, this looks for JSON datasets matching partition
patterns and date time formats, and determins which of the existent input
json partition directories need to be refined. Those that don't exist
in the configured output tables, and those that have had input
data modifications since a previous successful refinement are slated
for refinement.

This uses SparkJsonToHive to merge existent Hive table schemas with
those inferred by Spark from the JSON data.

If schemas can not be merged, a refinement will fail, but this will not
cause the entire JsonRefine job to fail. Reports about what has
succeeded and what has failed are output and optionally emailed.

Note:

Edit schema is a bust.  The CREATE TABLE statement generated by this logic works fine directly in Hive, but
via Spark it fails:

Failure(java.lang.Exception: Failed refinement of EventLogging Edit (year=2017,month=05,day=04,hour=12) -> otto.Edit (/user/otto/external/eventlogging2/Edit). Original exception: org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: struct<action:string,action.abort.mechanism:string,action.abort.timing:bigint,action.abort.type:string,action.init.mechanism:string,action.init.timing:bigint,action.init.type:string,action.ready.timing:bigint,action.saveAttempt.timing:bigint,action.saveFailure.message:string,action.saveFailure.timing:bigint,action.saveFailure.type:string,action.saveIntent.timing:bigint,action.saveSuccess.timing:bigint,editingSessionId:string,editor:string,integration:string,mediawiki.version:string,page.id:bigint,page.ns:bigint,page.revid:bigint,page.title:string,platform:string,user.class:string,user.editCount:bigint,user.id:bigint,version:bigint>. If you have a struct and a field name of it has any special characters, please use backticks (`) to quote that field name, e.g. `x+y`. Please note that backtick itself is not supported in a field name.)

This Error comes from Spark reading the schema out of hive directly, when calling hiveContext.table(tableName),
where the table has crazy struct fields with dots in the name.  Hive doesn't seem to mind.
I don't think we can fix this, and as such, the EventLogging Analytics Edit table will not be refineable by Spark.

mediawiki_page_properties_change also seems to not work, due to the variable object property types.

Otherwise, this is working great for both EventLogging Analytics tables, and for EventBus style data.

Bug: T161924
Change-Id: Ieb2c3a99501623d71fa58ac7dfb6734cb809096f

Details