When we first set up the revision-score event, we could not support map types in JSON data. This is because it is not possible distinguish between a 'struct' and a 'map' from JSON data alone. Maps look identical to structs. To work around this, we used [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/current.yaml#L11-L27 | arrays of score model objects ]] that themselves had [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/current.yaml#L39-L50 | arrays of scores as score name, score value probability ]].
We can now support map types by declaring a field to be an object with arbitrary (string) keys with explicitly declared value types. Querying array based data in Hive is difficult, so we'd like to use map types instead.
Currently, a `scores` data field is an array with objects, like:
```lang=json
[
{
"model_name": "awesomeness",
"model_version": "1.0.1",
"prediction": ["yes", "mostly"],
"probability": [
{"name": "yes", "value": 0.99},
{"name": "mostly", "value": 0.90},
{"name": "hardly", "value": 0.01}
]
},
{
"model_name": "other_model_name",
...
},
...
]
```
I believe (we should check with @JAllemandou and @Pchelolo) that we'd like to change the schema so that the scores field will be an object like:
```lang=json
{
"awesomeness" {
"model_name": "awesomeness",
"model_version": "1.0.1",
"prediction": ["yes", "mostly"],
"probability": {
"yes": 0.99,
"mostly", 0.90,
"hardly", 0.01
}
},
"other_model_name": { ... },
...
}
```
The change here is that instead of an array of scores, scores will be a map (object) of model name to score object. The score object will be mostly the same, except we want to change the `probabilty` field to also be a map of prediction names to probability (decimal) values.
In JSONSchema, we define a 'map' type to be an object with `additionalProperties` given a particular type, like:
```lang=json
"map_field": {
"type": "object",
"additionalProperties": {
"type": "string" // or whatever type your values are.
}
}
```
See the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/api/request/current.yaml#L46-L52 | api/request schema `params` ]] field for an example. Note that the map value "type" does not have to be a simple primitive like a "string"; it can also be an "object", as long as the object schema is explicitly declared. This will be the case for the top level `scores` field. It should have `additionalProperties: { type: object, properties: { ... }}`.
So, we need to change the JSONSchema for [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/score/current.yaml | mediawiki/revision/score ]] to accept the new map type data. Please read https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md and https://github.com/wikimedia/mediawiki-event-schemas/blob/master/README.md for some information about how to make schema changes.
[[ https://github.com/wikimedia/change-propagation | change-propogation ]] emits the revision-score event, so we'll need to modify the producer code there to send the new format. @Pchelolo to provide instructions here.
Note that this will be a backwards incompatible change! No one is using this data at the moment (we never released it publicly in [[ https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams | EventStreams ]]), so making this change backwards incompatible is ok. Once the schema change and data is live, it means that we will have to do some work on the analytics cluster side of things to most likely move and delete the old incompatible data out of the way. @ottomata will handle that part.