Page MenuHomePhabricator

mediawiki/page/properties-change schema should use map type for added and removed page properties
Open, MediumPublic

Description

Currently, the mediawiki/page/properties-change schema uses an object with undefined properties for added_properties and removed_properties. This makes it difficult to use in typed systems, and keeps us from importing the data into the Hive based data lake.

I believe this was done before we added support for map types. We should modify those fields to be string -> string map types, and make sure that EventBus only emits string values for the changed properties.

Even now, the emitted types are inconsistent for the same page property values. Here are some recent values from an event:

"added_properties": {
  "wb-claims": 41,
  "wb-identifiers": 18
},
"removed_properties": {
  "wb-claims": "40",
  "wb-identifiers": "17"
}

The added values are ints, but the removed values are strings.

Event Timeline

Change 683620 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683620 merged by Ottomata:

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683622 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Change 683622 merged by Ottomata:

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Ottomata triaged this task as Medium priority.Jun 11 2021, 6:18 PM

Adding perspective on this: Using map instead of structured and defined schema allows for more flexibility and versatility, but prevents deeper validation. The decision of using one versus the other should not be made lightly :)

Right now, this schema doesn't have any defined properties, and is not a map type either. So, JSONSChema object with additionalProperties: true. So, we don't import it into Hive at all.

We need a map type here because we have no control over what the keys are.