Page MenuHomePhabricator

Enhance line matching in diffs
Open, NormalPublic

Description

Author: random832

Description:
In the example URL, the lines beginning "Sed elit" on each side differ by only one character. These should be considered the "equivalent lines" to each other and show up in the same row of the table and get word-by-word hilighting. This is a general feature request for more intelligence in deciding what lines are "equivalent".


Version: unspecified
Severity: normal
URL: http://en.wikipedia.org/w/index.php?title=User:Random832/difftest&diff=199745398&oldid=199745250

Details

Reference
bz13462

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 21 2014, 10:07 PM
bzimport set Reference to bz13462.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Mar 21 2008, 3:08 AM
Huji added a comment.Mar 21 2008, 8:50 AM

The solution might affect bug 13466 somehow.

IAlex added a comment.Aug 1 2010, 8:53 AM
  • Bug 24618 has been marked as a duplicate of this bug. ***

sumanah wrote:

'This is a general feature request for more intelligence in deciding what lines are "equivalent".' I agree with Random832.

An acquaintance of mine gives this example:

http://dbclass.saintjoe.edu/wiki/index.php/Demo_Context

The diff: http://dbclass.saintjoe.edu/wiki/index.php?title=Demo_Context&diff=2773&oldid=2772

This person teaches English composition. He uses MediaWiki to do it. His students type their essays into MediaWiki, he improves them via an edit, and then they look at the diff together to understand what he changed and why. He finds that the diff calculation in MediaWiki is not robust enough and fails to sensibly show linebreak changes in some instances, and that this makes it much harder to use the diffs as a teaching tool.

"There were very minimal changes made to the article between the first and second revisions; however, I did add a number of paragraph breaks, and coalesced a couple of paragraphs.

"You can see that the paragraph breaks caused the diff "discernment function" to identify whole paragraphs as changes, when in fact all that happened with the addition of a simple line break."

Adding the "design" keyword to ping a designer to consider what we should really be doing regarding various diff generation and diff-viewing edge cases.

FT2.wiki wrote:

Agree, visiting here to report a similar request with this example:
http://en.wikipedia.org/w/index.php?diff=470024993&oldid=469833887

Examples of issues that should have been noticed by the diff engine/formatter:

  • Line starting "{{cquote|It is surprising" -- same or virtually same line appears left and right, diff engine fails to match them with no obvious reason why that should be. So they appear as a deletion + insertion, rather than shown adjacent. Common problem.
  • Same occurs lower down with line starting ":* Fibres from"
  • Under heading "=== Subsequent events ===" -- a paragraph has been added starting "An inquest into..." Surrounding text is unchanged. Instead of recognizing this as a simple one-paragraph addition, it's treating it as a removal of one paragraph and change to all text in all following paragraphs (ie believes each para has changed when they have merely moved down one para simultaneously due to the insertion). The last 2 paras in the section are then treated as new insertions which they aren't.
  • Line starting ": "I don't" edited to add a {{cquote| template. Instead of recognizing the few extra characters diff treated it as a completely substituted new paragraph.
  • Bug 349 has been marked as a duplicate of this bug. ***

Created attachment 9885
dwdiff

Histories are full of completely useless diffs like this https://www.mediawiki.org/w/index.php?title=Help%3AExtension%3ATranslate&action=historysubmit&diff=489225&oldid=487083 (just a random example, things can get much worse).

Word-level diff gives better results in such cases, see screenshot of a simple dwdiff -c (1.9; I see there are further improvements in later releases).

Attached:

(In reply to comment #6)

Word-level diff gives better results in such cases, see screenshot of a simple
dwdiff -c (1.9; I see there are further improvements in later releases).

According to docs (which are outdated) wikidiff2 «performs word-level (space-delimited) diffs» (now they're [always?] character-level), so it probably should be able to handle whitespace in a more sensible way, but I don't know how the different features can be merged/balanced. Moving under wikidiff2 anyway.

The bad matching of paragraphs is definitely harming my productivity. Raising to a bug to give it credit it should have.

  • Bug 23704 has been marked as a duplicate of this bug. ***

mr.heat wrote:

Here is a fresh example where the diff algorithm fails:

http://de.wikipedia.org/w/index.php?title=Holland-America_Line&diff=prev&oldid=103985082

sumanah wrote:

(In reply to comment #10)

Here is a fresh example where the diff algorithm fails:
http://de.wikipedia.org/w/index.php?title=Holland-
America_Line&diff=prev&oldid=103985082

That example is still kind of annoying, yeah.

mr.heat wrote:

(In reply to comment #11)

(In reply to comment #10)

http://de.wikipedia.org/w/?diff=103985082

That example is still kind of annoying, yeah.

As announced in bug #33331 I improved my user script a lot in the past months.

http://de.wikipedia.org/wiki/Benutzer_Diskussion:TMg/cleanDiff.js

Besides other features (it shrinks the word-level highlighting to character-level and improves the highlighting for single characters) it also fixes bad line matching like in the example above.

On of the reasons for bad line matching are spaces in otherwise empty lines. If a space is added to or removed from an empty line the diff algorithm gets confused. It tries to find an other empty line with the same amount of spaces. It will find one. But in almost all cases these empty lines don't belong together.

My proposed fix is to simply ignore all trailing whitespace when matching lines. Trailing whitespace never have a meaning in the wiki syntax. It's good to highlight it in the diff. But it should be ignored in the first step when the algorithm tries to match lines.

gryllida wrote:

only ignoring trailing whitespaces is not enough

https://test.wikipedia.org/w/index.php?diff=199552&oldid=199551

sumanah wrote:

I'm removing myself from cc as I prepare to leave Wikimedia Foundation, but I will leave my 2 cents here: improving the line matching in diffs seems, to me, a cool project that could go in https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects .

(In reply to Nemo from comment #15)

Cc bawolff due to
https://lists.wikimedia.org/pipermail/wikitech-l/2014-November/079427.html

It should be noted that displaying diffs and doing edit merges/edit conflicts use two different code paths, probably with different algorithms. Better line matching would be nice in both cases.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 15 2015, 12:42 AM
Ricordisamoa added a subscriber: Ricordisamoa.

Change 281285 had a related patch set uploaded (by Jdlrobson):
Show moved lines in diff view using JavaScript/CSS

https://gerrit.wikimedia.org/r/281285

You can move stuff -

- yay. Crude design but hopefully gives idea.

Jay8g added a subscriber: Jay8g.Apr 9 2016, 6:42 AM
Quoth added a subscriber: Quoth.Apr 12 2016, 4:46 PM

Change 281285 abandoned by Jdlrobson:
Show moved lines in diff view using JavaScript/CSS

Reason:
Preserved forever in https://phabricator.wikimedia.org/T135454

https://gerrit.wikimedia.org/r/281285

With the resolution of T195375, the example diff is rather ... different than it was before, showing the paragraph having been moved to a point before the lines deleted in the same diff. I don't know whether to consider that an improvement on that diff, but it might make solving this task somewhat easier.

Nemo_bis added a comment.EditedJun 21 2018, 7:36 AM

Indeed. In fact I had expected that development to fix most diffs here, but I see that's not the case: most of the examples above are still valid. I'm adding a couple screenshots for reference because it's very hard to reconstruct how diffs looked in the past:



FT2's example is still very relevant, although the paragraphs which superficially looked identical actually do have some minor difference in the middle, in addition to some markup change of a character or two at the beginning.