Page MenuHomePhabricator

[Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities
Closed, ResolvedPublic

Description

In this patch as part of T310302, we created an abstracted interface for iterating over a JSONSchema and converting it into a different (Java based) type system. We used this to convert from JSONSchema to Flink's DataStream Row and Table API schema type systems.

This was based on our implementation of the same code in analytics/refinery/source in the Spark JsonSchemaConverter.

We should remove the Spark specific JsonSchemaConverter and instead implement the data type conversions interface for Spark, and put that into wikimedia-event-utilties. analytics/refinery/source can then use the JsonSchemaConverter code from wikimedia-event-utilities.

Event Timeline

Along the way , we could consider implementing T278467: Use Hive/Spark timestamps in Refined event data.

Using the migrated converter would then require some manual Hive table migrations.

Patch away my friend! The code will be the easy/fun part. T278467: Use Hive/Spark timestamps in Refined event data would be really nice...but changes the way Hive tables auto-ingested, so requires a more careful migration. We could do that separately.

Change 933620 had a related patch set uploaded (by Ottomata; author: Ottomata):

[wikimedia-event-utilities@master] [WIP] Add eventutilities-spark module and implement schema conversions

https://gerrit.wikimedia.org/r/933620

WIP patch for doing this ^.

This does not implement conversion to the TimestampType, although perhaps it should? I realized that we won't really be able to use this until we are 100% off of old metawiki EventLogging schemas. Old schemas are JSONSchema Draft-3, which uses a different required field format. refinery-spark JsonSchemaConverter supports both. eventutilities-core JsonSchemaConverter only supports Draft-7 onward.

Alternatively we could implement Draft-3 required field support in eventutilities-core JsonSchemaConverter and do this soon.

could implement Draft-3 required field support in eventutilities-core JsonSchemaConverter

Done in patch. This will allow us to use this class in refinery-source, and delete the one there.

This does not implement conversion to the TimestampType, although perhaps it should?

Also done in patch, but in a parameterized way. When we are ready for T278467, we'll be able to test Refine using TimestampTypes via a parameter instead of having to upgrade eventutilities in refinery-source.

Change 936293 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Use eventutilities-spark JsonSchemaSparkConverter

https://gerrit.wikimedia.org/r/936293

Ahoelzl renamed this task from Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities to [Event Platform] Move Spark JsonSchemaConverter out of analytics/refinery/source and into wikimedia-event-utilities.Oct 20 2023, 5:29 PM
Ottomata triaged this task as Medium priority.Oct 23 2023, 3:10 PM

Change 936293 merged by jenkins-bot:

[analytics/refinery/source@master] Use eventutilities-spark JsonSchemaSparkConverter in Refine and elsewhere

https://gerrit.wikimedia.org/r/936293

Change 972894 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine - update refinery jar version for analytics test cluster refine job

https://gerrit.wikimedia.org/r/972894

Change 972894 merged by Ottomata:

[operations/puppet@production] test/refine - update refinery jar version for analytics test cluster refine job

https://gerrit.wikimedia.org/r/972894

This is deployed for Refine in the analytics-test hadoop cluster. Ran a refine job there and it worked, also ran EvolveHiveTable to manually create a test table, and it worked just fine!

I'll deploy this for production Refine on Monday.

Change 973837 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change

https://gerrit.wikimedia.org/r/973837

Mentioned in SAL (#wikimedia-operations) [2023-11-13T16:57:58Z] <otto@deploy2002> Started deploy [analytics/refinery@25ef91f]: deploying refinery with refinery-source 0.2.25 jars for T321854 [analytics/refinery@25ef91f2]

Mentioned in SAL (#wikimedia-operations) [2023-11-13T17:04:34Z] <otto@deploy2002> Finished deploy [analytics/refinery@25ef91f]: deploying refinery with refinery-source 0.2.25 jars for T321854 [analytics/refinery@25ef91f2] (duration: 06m 36s)

Mentioned in SAL (#wikimedia-analytics) [2023-11-13T17:07:55Z] <ottomata> deploying refinery with refinery source 0.2.25 jars and using 0.2.25 for refine job - T321854

Change 973837 merged by Ottomata:

[operations/puppet@production] refine - bump to refinery version 0.2.25 to pick up JsonSchemaConverter change

https://gerrit.wikimedia.org/r/973837

Change 973868 had a related patch set uploaded (by Ottomata; author: Ottomata):

[wikimedia-event-utilities@master] Fix JsonSchemaConverter Draft 3 required bug

https://gerrit.wikimedia.org/r/973868

Reverting deployment for production refine jobs. There was an edge case bug for old EventLogging Draft-3 JSONSchemas where an object node that marked itself as required fails conversion.

Fix for bug (with tests) here: https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/973868

Will need to release eventutilities and refinery-source, and deploy to apply bug fix.

Change 973868 merged by Ottomata:

[wikimedia-event-utilities@master] Fix JsonSchemaConverter Draft 3 required bug

https://gerrit.wikimedia.org/r/973868

Change 974246 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Bump eventutilities version to 1.3.2

https://gerrit.wikimedia.org/r/974246

Change 974246 merged by jenkins-bot:

[analytics/refinery/source@master] Bump eventutilities version to 1.3.2

https://gerrit.wikimedia.org/r/974246

Change 974545 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refinery job - bump jar versions for refine, test refine, and producecanaryevents

https://gerrit.wikimedia.org/r/974545

Change 974545 merged by Ottomata:

[operations/puppet@production] Refinery job - bump jar versions for refine and test refine

https://gerrit.wikimedia.org/r/974545

Mentioned in SAL (#wikimedia-analytics) [2023-11-15T14:51:12Z] <ottomata> deployed refine using refinery-job 0.2.26 JsonSchemaConverter from wikimedia-event-utilities - https://phabricator.wikimedia.org/T321854