Page MenuHomePhabricator

[Task] Use compact representation of diffs in EntityChange.
Closed, ResolvedPublic

Description

EntityChange (resp DiffChange) currently holds a full Diff representing the change (except for modifications to statements, which get suppressed for performance reasons, see commit 3160ff902fa). Such diffs can be large, and are rather slow to serialize and unserialize. We should use a compact representation instead, containing just the information we actually need on the client in order to process the change. A nested JSON-esque array structure could be used, providing the following infomration:

  • Sitelinks: siteid -> oldvalue + newvalue; (also: +/- badge?)
  • Labels (+descriptions): list of affected language codes
  • Statements: list of affected properties

This should be a lot more compact, faster to unserialize, use less memory, and be easier to handle on the client.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel raised the priority of this task from to High.Sep 23 2015, 12:01 PM
daniel updated the task description. (Show Details)
daniel added a subscriber: daniel.
daniel renamed this task from Use compact representation of diffs in EntityChange. to [Task] Use compact representation of diffs in EntityChange..Sep 23 2015, 12:05 PM
daniel set Security to None.

it's not just the diffs that was an issue but also generating the diff by serializing two versions of the entity before putting in wb_changes.

just an idea... if all we need to know are what 'aspects' a change entails, maybe these can be 'registered' by the changeops (e.g. it's a StatementChangeOp, then statements changed + possibly there were additional change ops) and then recorded in wb_changes when a save happens. maybe we can get the additional info then also that Daniel suggests.

@aude yes, something like that could work. Only for sitelinks we really need to know the actual change, so we can track the (un-)linking of pages on each client.

Change 270580 had a related patch set uploaded (by Hoo man):
Don't include Fingerprint diffs in wb_changes

https://gerrit.wikimedia.org/r/270580

No patch for this, yet… I've misattributed a change.

I suggest replacing the Diff in DiffChange with a new EntityDiffChangedAspects object. Initially both should be present, but only one should be in the serialized form (EntityChange::getSerializedInfo).

I would introduce a new boolean feature flag which changes which of the two are included in the serialization. Once all Client code works with the EntityDiffChangedAspects object, we can change the flag and remove it + the legacy client code N days after.

The new class would probably look something like this:

class EntityDiffChangedAspects {
	private int $formatVersion = 123;
	private array[] $siteLinkChanges = [ 'dewiki' => [ /* means sitelink changed or removed or added */ ], 'enwiki' => [ 'badges' ] ];
	private string[] $labelChanges = [ 'de', 'en' ];
	private string[] $descriptionChanges = [ 'de', 'es' ];
	private string[] $changedStatements = [ 'P31', 'P1337' ];
}

The proposed structure looks good to me. Documentation should make clear that the fields in the structure correspond to usage aspects, and are thus independent of Entity structure.

Note that we have to make sure that we have enough information on the client side to not only determine affected pages, but to also generate the appropriate edit summary (yes, in some cases, we look at the diff to generate a summary on the client side).

Also note that the client will currently re-calculate the diff when coalescing consecutive changes. With the new structure, it should be a lot easier to simply merge the diffs. That should be a significant performance gain for the case of coalesced changes. Not sure how much impact that has overall.

It's not clear to me why both representations of the diff should be present in the object if we only include one in the serialization. How would that even work with deserialization? The deserialized version would then have only one diff representation anyway. It seems to me we want to only include the kind of diff that we later want to actually use. Am I missing something?

The proposed structure looks good to me. Documentation should make clear that the fields in the structure correspond to usage aspects, and are thus independent of Entity structure.

They aren't equivalent to client usage aspects (they can contain more information, also EntityUsage::TITLE_USAGE is not wiki agnostic), but it should be easy to lay out the relationship.

Note that we have to make sure that we have enough information on the client side to not only determine affected pages, but to also generate the appropriate edit summary (yes, in some cases, we look at the diff to generate a summary on the client side).

This currently only uses ItemChange::getSiteLinkDiff (passed to SiteLinkCommentCreator::getEditComment). I'll probably see what is needed to create this diff from the new structure instead of trying to invent something new.
Depending on their size it might just be ok to use the actual sitelink Diff objects indexed by site id instead of coming up with our own structure, I'll check that.

Also note that the client will currently re-calculate the diff when coalescing consecutive changes. With the new structure, it should be a lot easier to simply merge the diffs. That should be a significant performance gain for the case of coalesced changes. Not sure how much impact that has overall.

This currently loads the full revisions on the client… I'll see what data is needed to do this w/o loading, but this is probably for another time.

It's not clear to me why both representations of the diff should be present in the object if we only include one in the serialization. How would that even work with deserialization? The deserialized version would then have only one diff representation anyway. It seems to me we want to only include the kind of diff that we later want to actually use. Am I missing something?

This was meant for the migration period only, after that is done, only the aspect/ flat diff should be included.

Change 384298 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Add EntityDiffChangedAspects Factory

https://gerrit.wikimedia.org/r/384298

Lucas_Werkmeister_WMDE lowered the priority of this task from High to Medium.Nov 7 2017, 1:30 PM

Change 384298 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add EntityDiffChangedAspects Factory

https://gerrit.wikimedia.org/r/384298

Change 391900 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Change handling of sitelinks in EntityDiffChangedAspects

https://gerrit.wikimedia.org/r/391900

Change 391900 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Change handling of sitelinks in EntityDiffChangedAspects

https://gerrit.wikimedia.org/r/391900

Change 392053 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] [very WIP][DNM][I don't know what I'm doing] Use EntityDiffChangedAspects

https://gerrit.wikimedia.org/r/392053

Change 392830 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Make AffectedPagesFinder use EntityDiffChangedAspects

https://gerrit.wikimedia.org/r/392830

Change 392830 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Make AffectedPagesFinder use EntityDiffChangedAspects

https://gerrit.wikimedia.org/r/392830

Change 393286 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Transmit compact diff instead of suppressed diff

https://gerrit.wikimedia.org/r/393286

Change 393780 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] All the backward compatibility needed for compact diff representation

https://gerrit.wikimedia.org/r/393780

Change 393791 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Transmit compact diff instead of suppressed diff

https://gerrit.wikimedia.org/r/393791

Change 393286 abandoned by Ladsgroup:
Transmit compact diff instead of suppressed diff

Reason:
Split now

https://gerrit.wikimedia.org/r/393286

Change 394044 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Avoid using EntityDiffChangedAspectsFactory like a static constructor

https://gerrit.wikimedia.org/r/394044

Opportunities that open up with compact diffs (as requested by @Lydia_Pintscher during PM/engineering time on 2017-11-28):

  • T106287: Client notifications related to description usages only work if we stop suppressing descriptions, which is what https://gerrit.wikimedia.org/r/390301 (T176417) already did the "old" way. Memory and database footprints grew because of this. Compact diffs will reduce them again.
  • T151717: Tracking statement usages per property only works if we stop suppressing statements, which we are not going to do without having compact diffs first.
  • T179923: Even if compact diffs don't technically block fine-grained tracking in Lua, it can only unleash it's full potential if statement tracking is not suppressed (which it currently is, see https://gerrit.wikimedia.org/r/355101).
  • The patch https://gerrit.wikimedia.org/r/392041 is blocked because it doesn't support statement tracking being suppressed. It misses the necessary fallback to an "other" or "all" usage. Solving this is entirely independent from having compact diffs. It probably should be solved no matter what, so we can suppress statement tracking any time.
  • Compact diffs are build in a way that allows much richer summary lines, if needed.

@hoo, @Ladsgroup: Did I forgot something?

Change 393780 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] All the backward compatibility needed for compact diff representation

https://gerrit.wikimedia.org/r/393780

Change 393791 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Transmit compact diff instead of suppressed diff

https://gerrit.wikimedia.org/r/393791

Change 394999 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@wmf/1.31.0-wmf.10] All the backward compatibility needed for compact diff representation

https://gerrit.wikimedia.org/r/394999

Change 394999 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.31.0-wmf.10] All the backward compatibility needed for compact diff representation

https://gerrit.wikimedia.org/r/394999

Mentioned in SAL (#wikimedia-operations) [2017-12-04T14:50:32Z] <Amir1> deployed backward compatibility of entity compact diff transmit T113468

Change 392053 abandoned by Ladsgroup:
[very WIP][DNM][I don't know what I'm doing] Use EntityDiffChangedAspects

Reason:
Not needed anymore

https://gerrit.wikimedia.org/r/392053

MariaDB [testwikidatawiki_p]> select * from wb_changes order by change_id desc limit 1;
+-----------+----------------------+----------------+------------------+--------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| change_id | change_type          | change_time    | change_object_id | change_revision_id | change_user_id | change_info                                                                                                                                                                                                                                                                                                                                |
+-----------+----------------------+----------------+------------------+--------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|    279782 | wikibase-item~update | 20171205231810 | Q107488          |             272359 |            774 | {"compactDiff":"{\"arrayFormatVersion\":1,\"labelChanges\":[\"fa\"],\"descriptionChanges\":[],\"statementChanges\":[],\"siteLinkChanges\":[],\"otherChanges\":false}","metadata":{"page_id":160772,"parent_id":272358,"comment":"\/* wbsetlabel-add:1|fa *\/ FFF","rev_id":272359,"user_text":"Ladsgroup","central_user_id":3349,"bot":0}} |
+-----------+----------------------+----------------+------------------+--------------------+----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

\o/

Change 394044 merged by Ladsgroup:
[mediawiki/extensions/Wikibase@master] Avoid using EntityDiffChangedAspectsFactory like a static constructor

https://gerrit.wikimedia.org/r/394044

Change 398815 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/Wikibase@master] Avoid using EntityDiffChangedAspectsFactory like a static constructor

https://gerrit.wikimedia.org/r/398815

Change 398815 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Avoid using EntityDiffChangedAspectsFactory like a static constructor

https://gerrit.wikimedia.org/r/398815