Page MenuHomePhabricator

Diffs of small changes can be misleading
Closed, DeclinedPublic


Author: bil0725

When viewing the difference between two versions of an article, often small differences are shown as big.

For example when adding a newline into a paragraph the diff viewer is confused and it looks like a big change.

Another example is if changing something little in a paragraph and added a new one above, like a headline. Then the slightly changed paragraph is not recognised and it looks like a big change. One example

Bug 5072 (3 years old) is related.

Version: unspecified
Severity: normal



Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:42 PM
bzimport set Reference to bz19092.
bzimport added a subscriber: Unknown Object (MLST).

tweaked the summary to be more descriptive

The dif is not that big. A single paragraph, formatted as a single line, is plit in two separate lines, and it is normal that these two lines are tagged in diffs. This is the normal behavior of Unified diffs which compares full lines. The actual result in fact displays more granular differences with coloring, isn't it enough ?
You seem to want that MediaWiki automatically splits lines into several parts to show the differences and similitudes between fragments of lines. I don't think it will be very useful (and in many cases it will just add many more differences, on very small fragments.

ayg wrote:

*** Bug 21953 has been marked as a duplicate of this bug. ***

wiki0007 wrote:

I disagree. Being able to see the differences between revisions is important, as it keeps people accountable for their changes. To flag a whole paragraph as being "deleted", then "added" (which is what it looks like in the revision history) because someone inserts a blank line before that paragraph is very misleading.

The javascript gadget Cacycle(WikEdDiff [] seems to be able to differentiate these types of changes. Perhaps someone could look at the code there, and convert it to the appropriate code used in displaying revision history.

conrad.irwin wrote:

There is no "best" diff output, it depends on personal taste. The beauty of the output of a diff depends as much on the postprocessing of the output as the algorithm used, or the parameters with which it is used.

The wikEdDiff uses a different theoretical approach, based on Heckel 1978, from MediaWiki, based on Myers 1986, which may or may not give better results overall. From memory, Heckel is generally better at spotting which strings came from where in the source, but can fail quite nastily if the source contains few unique words. In such cases it may generate very large diffs for very small changes - these failures are unintuitive, unlike the current failures, in which it's easy for a human to see why the computer has been misled (though I think they would be rarer in typical use, so maybe it would be an even match). The Heckel algorithm is, to my mind, quite beautiful in a language with inbuilt hashing, so a quick proof-of-concept should not be hard to whip up - though optimising it well enough to actually be used in MediaWiki might be a bit of a slog. (For further incentive, I have a primitive 3-way-merge tool based on Heckel which is wonderfully fun, you can safely fix a typo in the middle of a sentence while someone else moves the sentence to a new place, if we were to change the diff algorithm, we'd likely want the merge function to follow suit, though it's not necessary).

EN.WP.ST47 wrote:

Conrad Irwin explained why our diffs may not be excellent. There are a few different algorithms to make diffs, and every one of them has a few cases where it won't be quite right. Presumably the one we're using was selected because the devs thought it would perform best. If there's a suggestion to use a different algorithm, and justification as to why it is better, that can be opened as a new bug, however unfortunately it's not likely that anyone has the time or skill to design a perfect diff algorithm. Closing wontfix.