Page MenuHomePhabricator

mediawiki/recentchange event should not use fields with polymorphic types
Closed, DeclinedPublic

Description

In the new generation of MediaWiki event schemas we want to restric the schemas to have monomorphic types everywhere, thus require that a certain property could only have a single possible type. The only place where it's not true right now is in the recentchange event - we allow the properties to be null. This is not consistent with the rest of the schemas, where null properties are just absent from the event. We need to make recentchange events conform to this bt clearing out the null properties.

The emitted events have already been modified to not include fields if their values are null, so we should be able to just remove the [..., null] polymorphic types in a new major schema version.

recentchange 1.0.0 also has the log_params field which has type: [array, object, string]. We should convert this to a map type like:

log_params:
  type: object
  additionalProperties:
    type: string

To do this, we need to modify EventBusRCFeedFormatter in the EventBus extension to ensure that log_params is always an object with string keys and string values. This can get a little tricky in PHP since there isn't a different between integer indexed arrays and string indexed 'objects'.

EventBusRCFeedFormatter should do the following if $attrs['log_params'] is set:
If $attrs['log_params'] is defined and...

  • is_string: serialize as single element object with key of string "0": {"0": "value"}
  • is integer indexed PHP array: serialize as string-integer indexed object: {"0": "value1", "1": "value2", ...}
  • is associative indexed PHP array: serialize as string indexed object: {"key1": "value1", "key2", "value2", ...}

All values need to be properly converted into strings. I think strval will mostly work, but in order to convert booleans properly (e.g. true => "true", not "1"), we could perhaps just use FormatJson::encode($value, false, FormatJson::ALL_OK)?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Milimetric lowered the priority of this task from High to Medium.
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

Change 492047 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Remove properties with null values from the recentchange event.

https://gerrit.wikimedia.org/r/492047

Change 492047 merged by Ppchelko:
[mediawiki/extensions/EventBus@master] Remove properties with null values from the recentchange event.

https://gerrit.wikimedia.org/r/492047

Change 495760 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Correctly delete nulls from recentchange event and add test.

https://gerrit.wikimedia.org/r/495760

Change 495760 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Correctly delete nulls from recentchange event and add test.

https://gerrit.wikimedia.org/r/495760

After the patch was deployed we do not have nulls in recent change schema anymore, however we still can not declare victory and get rid of all of the polymorphic types in the schema. The log_params can be either an object or an array and, judging by the code, it can actually be a non-empty array in rare cases. Not sure what to do about that.

Since we'll be able to support map types with new schemas (once we get that stuff running in Hadoop), we might be able to just always convert log_params to a map in eventbus extension. Can we just always somehow make this a simple string: string map?

Ottomata renamed this task from EventBusRCFeedFormatter should clean up events from nulls to mediawiki/recentchange event should not use fields with polymorphic types.Sep 17 2019, 8:49 PM
Ottomata removed Pchelolo as the assignee of this task.
Ottomata updated the task description. (Show Details)
Ottomata added a subscriber: Clarakosi.

I modified the task description here to convert log_params into a map type. Alternatively we could just always serialize log_params as a JSON string...but I think that would be much less useful. Usually params are key->val pairs.

Ottomata updated the task description. (Show Details)

Change 537549 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Use schema repository tests from jsonschema-tools

https://gerrit.wikimedia.org/r/537549

Change 537549 merged by Ppchelko:
[mediawiki/event-schemas@master] Use schema repository tests from jsonschema-tools

https://gerrit.wikimedia.org/r/537549

Hm, I'm not so sure if we can do this. recentchange is pretty heavily used externally, and changing log_params type will likely break consumer code.

I guess we can just decline this and keep ignoring the recentchange schema in our tests, and not refining it into Hive, etc.