Page MenuHomePhabricator

Added or removed lines in between changes messes with diff alignment
Open, Needs TriagePublic

Description

When lines are added or removed in between changed paragraphs the diff algorithm fails to align these correctly to detect the changes at the right places. - In the past this behavior lead to really strange diffs where it seemed that paragraphs were completely rewritten.

Now at least the moved paragraph detection kicks in to make sense out of what's happening, but the diff could still be looking much simpler. See the example diff and the comparison below {1].

Before the introduction of paragraph detection:

Honeyguide - before.png (843×1 px, 123 KB)

After the introduction of paragraph detection:

Honeyguide - after.png (994×1 px, 117 KB)

Ideally the alignment in the diff would be fixed so the changed paragraphs from the left and the right side are shown next to each other. A simple approach to do that in cases when empty lines following paragraphs are altered produced to many false positives, so I just mention it here: T184531: Treat empty lines as part of the previous paragraph

This was originally reported on wiki by Chiswick Chap (as well as the IP who made the edit) at [2].

[1] https://en.wikipedia.org/w/index.php?title=Honeyguide&diff=prev&oldid=779422017
[2] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_155#What.27s_up_with_the_diff_display.3F

Event Timeline

Aklapper renamed this task from Diff regression? to Specific diff shows separate sections while only some stuff within the sections was changed .Aug 10 2019, 5:56 PM
Aklapper renamed this task from Specific diff shows separate sections while only some stuff within the sections was changed to Specific diff shows separate paragraphs, while only some text within the paragraphs was changed .
Aklapper updated the task description. (Show Details)

@Nirmos: I think this is not a problem anymore?

Screenshot from 2019-08-12 17-41-52.png (863×1 px, 142 KB)

Can this task be resolved?

Things are still broken. For a minimal test case, see for example https://sv.wikipedia.org/w/index.php?diff=46200983

The .diff-deletedline is further down than .diff-addedline, despite the article only containing one line, and that's the line being changed. It didn't use to be like this, but it's been like this for quite some time now.

Things are still broken. For a minimal test case, see for example https://sv.wikipedia.org/w/index.php?diff=46200983

The .diff-deletedline is further down than .diff-addedline, despite the article only containing one line, and that's the line being changed. It didn't use to be like this, but it's been like this for quite some time now.

This is a pretty clear and reproducible simple thing, that's why I put it into a separate ticket: T230432: Changes in lines with only one word are not detected as change. It seems to be a regression from the wikidiff2 algorithm changes.

The original issue described in this ticket describes a different problem. The example diff [1] above looks a bit confusing due to the new line that was removed, while also edits in the paragraphs were done. - When lines are removed or added the diff algorithm does not align these changes very well in some cases.

That's a problem that exists for some time now and also has nothing to do with the changes in wikidiff2 done in the last three years. The ability to detected changed and moved paragraphs improves that situation even slightly. We tried to fix the alignment while working on wikidiff2 but the problem turned out to be quite complex T184531: Treat empty lines as part of the previous paragraph ( we even used the above example for that )

[1]https://en.wikipedia.org/w/index.php?title=Honeyguide&diff=prev&oldid=779422017

WMDE-Fisch renamed this task from Specific diff shows separate paragraphs, while only some text within the paragraphs was changed to Specific diff shows separate paragraphs, while only some text within the paragraphs was changed.Aug 13 2019, 5:20 PM
WMDE-Fisch edited projects, added wikidiff2; removed Regression.

@Nirmos if you're fine with it, I would change the ticket description to reflect the bug I'm describing above. - I did not really find a ticket for that so far so it might be good to use this one.

Absolutely, I'm very happy that you're trying to sort things out! 👍

WMDE-Fisch renamed this task from Specific diff shows separate paragraphs, while only some text within the paragraphs was changed to Added or removed lines in between changes messes with diff alignment.Aug 14 2019, 7:12 AM
WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch updated the task description. (Show Details)

See also the examples given in T182300 where the alignment should be improved rather than showing a moved paragraph.

tstarling added subscribers: Zazpot, tstarling.

Example reported at T7072

As discussed here, another example of this bug is at https://en.wikipedia.org/w/index.php?title=MalwareTech&diff=782456060&oldid=782449642 . In that diff, both the left and right columns have hunks that begin "Following his work on the WannaCry ransomware attack in 2017" and that are almost identical (edit distance: 3) but that have been aligned with other hunks instead of with each other, making it very hard to spot what has changed between them. (To spare you searching, it is "he's" to "he has".)

I expect the solution to this bug will involve matching paragraphs according to minimum edit distance, with a fallback algorithm in case two or more paragraphs are equal edit distances away.