Page MenuHomePhabricator

Fix active EventLogging schemas that added backwards incompatable required fields.
Closed, DeclinedPublic

Description

We're trying to use the EventLogging schemas to create Hive tables. In order to do this, the latest schema revisions must be backwards compatible with previous ones. Backwards compatibility basically means that you are only ever allowed to add optional fields. I just updated https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines to make this restriction more visible.

I've ran into several cases where people have added required fields to schemas. This is preventing the Hive importer from using the latest schema to read old data.

If there are any producers of event data that still use the old schemas, we won't be able to import this data into Hive. To do so, we need to make sure that the latest schema for each of these is compatible with all schema revisions that are currently being used to produce events. This means that we'll need to re-add removed fields, and change any added required fields to non-required.

(This list might not be comprehensive. I just looked at the recent history of schemas that I encountered errors with and found these changes.)

@Gilles:

@mpopov:
Renaming fields is not allowed. You are removing required fields, and then adding new required fields. This is not backwards compatible!

@chelsyx :

@Dbrant
added required fields, ts, appInstallId, synced etc.:

@EBernhardson

Event Timeline

I might suggest adding a validation step to saving schema's on mw.org that asserts these limitations? It would likely need some sort of "I know what I'm doing" override button when adjusting perhaps a new schema that has never been produced to during initial development.

On mw.org will be hard, but for Modern Event Platfrom schemaa repositories, for sure!
https://phabricator.wikimedia.org/T206889

Change 508655 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Make all fields not required when reading data using JSONSchema

https://gerrit.wikimedia.org/r/508655

So, I was able to find a way to make the Refine work without requiring that we alter all of these schemas. For the offenses that are just adding new required fields, let's start from this point forward and not do it anymore! :)

https://meta.wikimedia.org/w/index.php?title=Schema%3AMobileWikiAppToCInteraction&type=revision&diff=19044853&oldid=18389174 looks pretty bad though. I think 3AMobileWikiAppToCInteraction will be fine as long as there are no events being currently produced that have e.g. num_peeks as string instead of an integer. ping @Dbrant.

The most egregious offense! Type changes are definitely not allowed!

Mea culpa! But I'm not sure I see where the type change occurred. All I'm seeing is that the action parameter went away, and was replaced by num_peeks.

as long as there are no events being currently produced that have e.g. num_peeks as string instead of an integer.

This should be correct. AFAIK the fields that are currently in that schema receive data that always corresponds to their type.

Ah sorry! You are right, I read the diff hastily (sorry was reading lots of diffs yesterday). This is just another slightly less egregious offense then, and I think my workaround can handle it.

Thanks!

Change 508655 merged by Ottomata:
[analytics/refinery/source@master] Refine - Make all fields not required when reading data using JSONSchema

https://gerrit.wikimedia.org/r/508655

fdans subscribed.

Given the amount of work required to fix this and with eventgate this won't be an issue, let's not fix it.