Page MenuHomePhabricator

Investigate what we can do to solve Wish #2 - Show text changes when moving text blocks
Closed, ResolvedPublic5 Estimated Story Points


Context: Wish number 2 of the German-speaking community wishlist 2015 is improving the diff algorithm, specifically when moving blocks of texts.

Problem: Currently, some changes are masked by other changes. For example, when a chunk of text is moved, people cannot see if there have been changes within the chunk of text, or whether it stayed the same.
Another problem is adding new lines, which sometimes confuse the algorithm, who then cannot match the first and second part of the chunk respectively. This again makes it hard for users to find the places where a change has actually happened.

We have already estimated that fulfilling the wish completely is probably not doable. Which are the parts that we could improve?
Check, at which costs/ risks/ benefits we could

  • Fulfill the wish completely, e.g. by reusing WikEdDiff or any of the other tools @Jan_Dittrich mentioned?
  • Highlight text blocks that have moved, but do not contain any changes within the text
  • Have the diff algorithm not be confused because of introduced white lines
  • realize other challenges presented here:

For context, please also have a look at
T121469: Improve diff compare screen
T15462: Enhance line matching in diffs with the screenshot T15462#2175978

Event Timeline

There is a Wiki Extension which can recognize moved Text by User:Cacycle based on an algorithm described in Paul Heckel: A technique for isolating differences between files Communications of the ACM 21(4):264 (1978). (Demo)

related: T121469 »Improve diff compare screen«

related concepts:

  • WikEdDiff shows moved text using inline-arrows to indicate that something moved away, on hover, the move destination is highlighted. (see EkEdDiff Demo)
  • @Jdlrobson created this design:
    Screen Shot 2016-04-03 at 4.14.53 PM.png (542×1 px, 215 KB)
    in which arrows show the from and to of a move.
  • IBMs old History Flow which uses some sort of stacked chart to show changes and their flow. It also shows moves.

@Jan_Dittrich thanks for gathering all the different sources! Could you make a suggestion how you would want to solve the issue of moved paragraphs ui-wise?

For this particular task I think that the arrow-solution of Jdlrobson is a good design (However, I would show the text only once). But since the arrows take much space and would e.g. interfere with line numbers (or whatever else one might want to display on the left side), the WikEdDiff solution might play better with others. How does the team evaluate the issue of clashes with other elements or functions? (@WMDE-Fisch, @WMDE-leszek)

@Lea_WMDE I assume that we would do the diff based on Wikimarkup, not on rendered HTML? (which means, it is copy-paste-able to the editor)

Also, another user wrote me that there is an automerge script. It basically adds

  • a "take mine" button (and warns of using it carelessly) as well as
  • an "automerge" button which tries an automatic merge.
WMDE-Fisch set the point value for this task to 21.Jul 14 2016, 2:20 PM
WMDE-Fisch moved this task from Proposed to Backlog on the TCB-Team-Sprint-2016-07-14 board.
WMDE-leszek changed the point value for this task from 21 to 5.Aug 2 2016, 11:11 AM

Here's a summary of my findings while looking at diff improvement possibilities. Let me admit that while investigating this I tried to NOT only focus on the scope of German Whistlist item #2, ie. showing changes in moved text blocks.

I have made an attempt of merging the solution proposed in [1]. This is quite significantly expanded version of wikidiff2 C++ diff engine written some years ago. Source code included in [1] in not entirely in sync with the current state of wikidiff2 (which does not change that often but it does change) but I've managed to merge it in and compile. [2] At this point I also have to admit that the version I managed to cook does not really work. For some reason(s) diff engine recognizes all changes as additions. Having spent some time on debugging/investigating what's happening I haven't found out was it because of my mistakes while trying to merge those two pieces of code, or because of WikidiffLX needing some refreshing to work. Nevertheless I would argue this particular failure of mine does not prevent me to have some observations.

WikidiffLX changes rather thoroughly how the engine works, with the most significant change being changing the way line breaks are considered. This change adds a significant amount of processing and comparisons to the engine actions (which is clear: wanting to consider more cases you need to check more things), which in turn affects performance. Keeping in mind my "version" of wikidiff2 including a code of WikidiffLX is broken its performance does not seem acceptable (it is far too slow). Even assuming we could have a working version of WikidiffLX using the code its original author proposed I believe there will be some performance issues to solve. The author of WikidiffLX admits he has not extensively tested the performance:

More sophisticated examples need to be designed now in order to detect obstacles. A statement on performance can't be made at this stage.

When thinking about improving/expanding/changing wikidiff2 engine it seems to me it makes sense to take smaller steps. For instance I'd rather try to take a single idea from WikidiffLX (or any other alternative diff engine) and try to apply it into wikidiff2, including extensive testing and performance optimization. Trying to merge a complex solution into wikidiff2 simply seems scary to me. It would require a significant amount of work without any guarantee results are worth it. My quick attempt of moving over parts of WikidffLX might be proving this.

I have similar concerns on possible merging wikidiff2 with WikEdDiff, currently being PHP extension [3]. I haven't looked in details at the source code of WikEdDiff for simple reason: these two engine are two separate beings and some parts of WikEdDiff can't be just simply moved over to wikidiff2. What might be worth considering is trying to reuse some good ideas of WikEdEdiff in wikidiff2 (without needing to completely rewrite the diff engine). This seems also been the way to go that came up in some IRC discussions [4] I had found in T121469.

When it comes to EditConflictAutoMerge script [5] that has been also mentioned in this ticket it sounds to me it is beyond the scope of this ticket although it might be something to consider when thinking on improving of how edit conflict resolution works.

To finally come to some conclusions, I see the following options how the current diff screen and/or algorithm could be improved. Those apply for showing moved blocks of texts but I believe also to other possible improvements to diff, including but not limiting to those listed in [6]

  1. Look in WikEdDiff, WikidiffLX (and any possible other tool dealing successfully with a problem in question) for ideas on improving wikidiff2 in some particular regard (be it for instance marking moved text blocks as its own change type), and incrementally expand wikidiff2, keeping performance in mind.
  2. Work on/Support attempts on allowing user to select alternative diff engines. For instance there could be a setting in user preferences (or somewhere else) allowing to switch between wikidiff2 and WikEdDiff engines. Currently WikEdDiff uses a hook to override the global diff engine settings which might work for extensions but it also limits possibilities of using different engines.
  3. Similar to 2 but regarding "diff presenters", e.g. allowing user to switch between two-column and inline diff (or any other diff formatter). I am personally very excited about the ongoing attempt on this [7].
  4. (dewiki-specific) Discuss with German Wikipedia community on possibilities of enabling WikEdDiff gadget [8] [9] on dewiki. This gadget is (to my understanding) JavaScript implementation of WikEdDiff and it is avaliable on English Wikipedia. The gadget inserts a tiny icon above the standard diff, which on demand expands an inline diff. Same as the PHP extension [3] it highlights blocks of texts that has been moved.

According to [10] gadget is used by few thousand users on enwiki and some users on few other wikis.

It seems to me the list above is sorted by the descending amount of time and work required by the particular task and at the same time by the descending level of complexity. Options 2 and 3 are close to each other and might be overlapping. It might be a case that those should even be considered and implemented together. Option 4 should definitely be considered a temporary solution although given the number of votes this German Wishlist item received it might be something that German-speaking community would considered valuable.

@jkroll: Do you have some more findings to share on this topic? As I haven't come to any specific outcome I'd be very interested if you had something to share!

[2] - this is a very ugly patch made to just try out what happens. IMO it does not make sense to make it pop up on wikidiff2 gerrit dashboard and confuse people. That's why I keep the patch a draft. Feel free to ping if you feel like having a look. But it is really nasty. You have been warned!

Hi Leszek,

looking at your patch, reading about the difficulties applying all the stuff in one go, and the performance issues, I would agree that implementing single changes from scratch, while taking hints from the existing alternatives, is the best way to go. As a next step, I would suggest I create a new VM, put MediaWiki and the necessary stuff on there, give admin access to anybody who will be potentially working on it, and create a webproxy. This way, we can either work on this collaboratively, or at least show the current status to other people without everybody having to patch and build things.

One thing I'm not clear about is the performance issue. I'm the first one to say optimization is a good thing, but only if there's something to be gained by it. In this case, I'm not sure why we need a separate C++ implementation. After all, the diff is just for single pages, correct? Apparently there's even a diff implementation in JavaScript which works, so why do we need an optimized C++ implementation in addition to the PHP one?

Lea_WMDE claimed this task.

Having discussed this in the WMDE-TechWish, these are our conclusions:

  • It would probably make most sense to go with Leszek's option 1: Work on incrementally improving the diff algorithm of wikidiff2
  • Due to performance reasons we will always have to write this in C++ and php
  • First ideas (but this is not very definite yet) which would be the tasks to begin with:
    • moved text blocks, where nothing has changed in between
    • Considering line breaks instead of paragraphs as the entity that changed
    • Use something similar to google's MatchPatch to find out if small changes have been made to moved text blocks
    • Be able to handle add one / remove two section situations