Page MenuHomePhabricator

Better Diffs: Wikidiff2 revise algorithm
Closed, ResolvedPublic

Description

Feature summary (what you would like to be able to do and where):
The Wikidiff2 PHP extension needs to be able to capture changes in a paragraph split. The algorithm does not capture a text change in some cases within a paragraph split within the diff view.

Acceptance Criteria

User Story:
As a visitor to the wikis, when I compare two versions of a page in Wikitext in-line and two-column diffs, I want to clearly identify changes to words within the paragraphs that were split up, so that I can understand what changed

  • In this instance the word 'at' was changed to 'ass' and page patrollers would not be able to easily identify this change.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

Benefits (why should this be implemented?):

Details for QA:

Designs(optional):

Impact:

Contact person/team(optional):

Related Conversations(feel free to copy and paste):

Timing:

Event Timeline

tstarling added subscribers: WMDE-Fisch, thiemowmde.

Current idea:

Split DiffEngine::diff() after the shift_boundaries() calls. The loop labelled "Compute the edit operations" would move to a separate class, along with detectDissimilarChanges().

Line split detection is quite similar to detectDissimilarChanges(). It needs the same kind of data and concepts. So split detection will be integrated with detectDissimilarChanges() or will at least be close to it.

The output from the detection phase will be like DiffOp, but with a finer granularity. Most change ops will be from one line to one other line, but when a split is detected, the change op will be from one line to multiple lines. Line diff formatters don't need the op aggregation provided by DiffEngine anyway, they're all just doing a loop over all lines in the DiffOp. Word diff formatters slightly benefit from it, by helping them to merge adjacent insertions or deletions into a single <ins> or <del> element. But that could be done during iteration, there's no obvious reason why the result of aggregation needs to be put in a data structure.

Algorithm for line split detection integrated with detectDissimilarChanges():

  • Compare the current LHS line with the RHS line. Compute the similarity metric charSimilarity.
  • Concatenate the next RHS line with the current line. Compute the similarity metric again.
  • Continue concatenating RHS lines until the similarity metric declines.
  • If the optimal similarity metric discovered by this process was less than the threshold, output delete + add.
  • Otherwise, output a change operation in which the LHS line is changed to one or more RHS lines.

In this proposal, lines which are both moved and split would not be detected. To do that, a second round of split detection would be needed, integrated with move detection.

To format a line split, do a word diff with special line-break words in the RHS. How will this be presented to the formatter subclasses? To be done.

Investigating the perf profile from benchmarks/bench.php shows that the most obvious performance problem with the current system is redundant calls to TextUtil::explodeWords() and DiffEngine<Word>, accounting for 30% and 37% of runtime respectively. There could be shared caches for these operations.

Input from @thiemowmde and @WMDE-Fisch would be greatly appreciated.

I think updating the JSON format is out of scope for me. My patch will have backwards-compatible JSON output, and the mobile apps team team (@Tsevener) can take it from there.

I'm thinking that there will be a new virtual method to format a line split. The JSON subclass can initially do it by simulating the old behaviour. Later, we can have a feature flag passed down from PHP.

Change 878242 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] Split formatting into a separate class hierarchy

https://gerrit.wikimedia.org/r/878242

  • Continue concatenating RHS lines until the similarity metric declines.

One problem with this is that it can detect a move as a split. Split detection competes with move detection. Say if you compare revision X with revision Y. Revision X has lines X1, X2, ..., and Y has Y1, Y2, ... The similarity metric is "~".

Line X1 ~ Y1 is a poor match. X1 ~ [Y1, Y2] is slightly better. X1 ~ [Y1, Y2, Y3] is much better. Is it a split, or has X1 moved to Y3, with unhelpful text in Y1 and Y2?

It seems more likely to be a split if concatenation also improves the metric when it's done in reverse. Originally we noted that (X1 ~ Y1) < (X1 ~ [Y1, Y2]) < (X1 ~ [Y1, Y2, Y3]). The metric increases as you concatenate more lines. Do the same algorithm in reverse, prefixing the lines: (X1 ~ Y3) < (X1 ~ [Y2, Y3]) < (X1 ~ [Y1, Y2, Y3]). If this condition holds then it looks more like a split.

Change 880430 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] Split out LineDiffProcessor and add WordDiffCache

https://gerrit.wikimedia.org/r/880430

Change 880431 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] Line split detection

https://gerrit.wikimedia.org/r/880431

Line X1 ~ Y1 is a poor match. X1 ~ [Y1, Y2] is slightly better. X1 ~ [Y1, Y2, Y3] is much better. Is it a split, or has X1 moved to Y3, with unhelpful text in Y1 and Y2?

In PS4 this is partly addressed by having requiring that at least 10% of the LHS line has to be in the first RHS line. X1 ~ Y1 has to be at least 10%.

A new problem I've noticed now that the patch is working: if two line breaks are inserted in the middle of a line, split detection fails, because concatenating an empty line decreases the similarity metric and breaks out of the loop. We'll need a heuristic for that.

@Lena_WMDE Do the folks in the Technical Wishes who last touched this codebase have any capacity for reviewing this code? Thanks so much!

My patch will have backwards-compatible JSON output, and the mobile apps team team (@Tsevener) can take it from there.
I'm thinking that there will be a new virtual method to format a line split. The JSON subclass can initially do it by simulating the old behaviour. Later, we can have a feature flag passed down from PHP.

@tstarling Sounds good, as long as the JSON output will remain unchanged (for now) for the apps. Let us know if you need us to unblock anything for web deployment. Thanks!

CCing @JTannerWMF as a heads up here. I can spin off a ticket for improving paragraph split appearance for the apps. (T327518)

Current idea:

Split DiffEngine::diff() after the shift_boundaries() calls. The loop labelled "Compute the edit operations" would move to a separate class, along with detectDissimilarChanges().

Line split detection is quite similar to detectDissimilarChanges(). It needs the same kind of data and concepts. So split detection will be integrated with detectDissimilarChanges() or will at least be close to it.
...

I can't really comment on your suggestions objectively at the moment but I wanted to leave a note so you know that I've started looking into the repo. Sorry for the late reply but I wanted to let you know that we are not ignoring your patches.

Change 884150 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] [WIP] Enable split detection for table diffs

https://gerrit.wikimedia.org/r/884150

Change 884150 abandoned by Tim Starling:

[mediawiki/php/wikidiff2@master] [WIP] Enable split detection for table diffs

Reason:

https://gerrit.wikimedia.org/r/884150

Change 878242 merged by jenkins-bot:

[mediawiki/php/wikidiff2@master] Split formatting into a separate class hierarchy

https://gerrit.wikimedia.org/r/878242

Change 880430 merged by jenkins-bot:

[mediawiki/php/wikidiff2@master] Split out LineDiffProcessor and add WordDiffCache

https://gerrit.wikimedia.org/r/880430

Change 880431 merged by jenkins-bot:

[mediawiki/php/wikidiff2@master] Line split detection: infrastructure and initial algorithm

https://gerrit.wikimedia.org/r/880431

I think its safe to say we can mark this as resolved. @tstarling any objections?