When we first set up the revision-score event, we could not support map types in JSON data. This is because it is not possible distinguish between a 'struct' and a 'map' from JSON data alone. Maps look identical to structs. To work around this, we used arrays of score model objects that themselves had arrays of scores as score name, score value probability.
We can now support map types by declaring a field to be an object with arbitrary (string) keys with explicitly declared value types. Querying array based data in Hive is difficult, so we'd like to use map types instead.
Currently, a scores data field is an array with objects, like:
[ { "model_name": "awesomeness", "model_version": "1.0.1", "prediction": ["yes", "mostly"], "probability": [ {"name": "yes", "value": 0.99}, {"name": "mostly", "value": 0.90}, {"name": "hardly", "value": 0.01} ] }, { "model_name": "other_model_name", ... }, ... ]
I believe (we should check with @JAllemandou and @Pchelolo) that we'd like to change the schema so that the scores field will be an object like:
{ "awesomeness" { "model_name": "awesomeness", "model_version": "1.0.1", "prediction": ["yes", "mostly"], "probability": { "yes": 0.99, "mostly", 0.90, "hardly", 0.01 } }, "other_model_name": { ... }, ... }
The change here is that instead of an array of scores, scores will be a map (object) of model name to score object. The score object will be mostly the same, except we want to change the probabilty field to also be a map of prediction names to probability (decimal) values.
In JSONSchema, we define a 'map' type to be an object with additionalProperties given a particular type, like:
"map_field": { "type": "object", "additionalProperties": { "type": "string" // or whatever type your values are. } }
See the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/api/request/current.yaml#L46-L52 | api/request schema params ]] field for an example. Note that the map value "type" does not have to be a simple primitive like a "string"; it can also be an "object", as long as the object schema is explicitly declared. This will be the case for the top level scores field. It should have additionalProperties: { type: object, properties: { ... }}.
So, we need to change the JSONSchema for mediawiki/revision/score to accept the new map type data. (We'll also likely want to change the errors field to be a similar map keyed by model name.) Please read https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md and https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md for some information about how to make schema changes.
change-propogation emits the revision-score event, so we'll need to modify the producer code there to send the new format. @Pchelolo to provide instructions here.
Note that this will be a backwards incompatible change! No one is using this data at the moment (we never released it publicly in EventStreams), so making this change backwards incompatible is ok. Once the schema change and data is live, it means that we will have to do some work on the analytics cluster side of things to most likely move and delete the old incompatible data out of the way. @Ottomata will handle that part.
See also: T167180: Emit revision-score event to EventBus and expose in EventStreams T197000: Modify revision-score schema so that model probabilities won't conflict