Investigate possible regression due to wikidiff2 1.5 change detection
Open, Needs TriagePublic

Description

On de.WP there's currently a discussion if the latest wikidiff2 changes (T177891) made things worse while viewing diffs. [1]

[1] https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Diff_erkennt_ge.C3.A4nderte_Zeilen_als_neu ( German )

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 10 2017, 4:59 PM
WMDE-Fisch updated the task description. (Show Details)Nov 10 2017, 5:00 PM

I imported an example from the thread to my local wiki and opened the diff with an older version of wikidiff2 and the current version. Results:

https://de.wikipedia.org/w/index.php?title=Bahnhof_Fl%C3%BCelen&curid=6196964&diff=170846535&oldid=170503518

Version 26738b5 2016-06-28 (tag: 1.4.1)

Version 5e153da 2017-10-12 (tag: 1.5.1)

MarcoAurelio renamed this task from Investigaet possible regression due to wikidiff2 1.5 change detection to Investigate possible regression due to wikidiff2 1.5 change detection.Nov 10 2017, 5:14 PM

Next example from the post: https://de.wikipedia.org/wiki/Spezial:Diff/170818650

( I zoomed a bit here so you have the whole picture but I guess you can see that it is identical )

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

Next example: https://de.wikipedia.org/w/index.php?title=Benutzer:PDD/markAdmins.js&curid=3238133&diff=96378622&oldid=95771630

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

Next diff: https://de.wikipedia.org/w/index.php?diff=170728571

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

Actually, in most of the examples above both the old and the new behavior is bad, because either the wrong paragraphs (old version of wikidiff2) or neither paragraphs (new version of wikidiff2) are connected, which makes me wonder whether this really was always the case or whether there is yet another regression that went unnoticed, especially since the example from comment T180259#3751693 was linked in a discussion about T35331, which suggests that in a previous version the paragraphs were linked as expected.

Anyway, two more examples where paragraphs aren't compared, though they previously were (at least I'm quite sure about it) and they still should in my opinion. Since I don't have the old version of wikidiff2 here, I can't provide screenshots, though.

https://de.wikipedia.org/w/index.php?title=Internationale_Mathematik-Olympiade&diff=167457198&oldid=164451757: Of course, by adding the town and unlinking the country I changed much in one line, but it still would help to have old and new content next to each other.

https://de.wikipedia.org/w/index.php?title=Internationale_Mathematik-Olympiade&diff=167483670&oldid=167457198: The new paragraph ("Die Aufgabe, bei der bisher (Stand 2017) ...") consists of two parts. Part 1 is the first old paragraph with a few changes, part 2 the last old paragraph, again with some changes. Of course, it can't be linked to both, but linking it to either is better than linking it to none.

One example where I do have a screenshot of the old version (https://commons.wikimedia.org/wiki/File:Wiki_labels_screenshot_(zoomed_--_diff).png ): https://en.wikipedia.org/w/index.php?title=Marbella_Cup&diff=646838927&oldid=645151368
While the old version clearly shows that previously empty table cells were filled in, the new version is almost twice as long and doesn't show this clearly. So even though the new version here works as advertised (no diff shown for lines that changed very much), it makes things worse.

Searching old screenshots of diff screens will find lots of more examples. One interesting case from it.wiki: https://commons.wikimedia.org/wiki/File:Confronto_diff_monobook.png / https://it.wikipedia.org/w/index.php?title=Colle_Vento&diff=prev&oldid=7897666
Sure, the second paragraph changed a lot. But the new version hides the fact that also a lot stayed the same. There are more similar changes later on (but not in the screenshot on Commons, so I currently can't compare them to the previous version), but I'm quite sure that for those, too, previously a diff was shown, while now wikidiff refuses to show a diff for them, hiding the fact that they are clearly based on the previous version for a large part.

I am adding screenshots for the last examples. I think we got the idea now and will write something to that in the next post.

https://de.wikipedia.org/w/index.php?title=Internationale_Mathematik-Olympiade&diff=167457198&oldid=164451757

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

https://de.wikipedia.org/w/index.php?title=Internationale_Mathematik-Olympiade&diff=167483670&oldid=167457198

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

https://en.wikipedia.org/w/index.php?title=Marbella_Cup&diff=646838927&oldid=645151368

Image on commons

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

https://it.wikipedia.org/w/index.php?title=Colle_Vento&diff=prev&oldid=7897666

Image on commons

Version 26738b5 2016-06-28 (tag: 1.4.1) (local)

Version 5e153da 2017-10-12 (tag: 1.5.1) (online)

Thanks for all the example diffs, it's really important to see a bunch of variants in diffs to further improve the algorithm there.

The newest version of wikidiff2 tries to improve the diff for cases where the former version would consider two completely different paragraphs as a mere change even though they just share some parts that are not really connected. It seems the way that this is done can still be improved and we need to consider some more edge cases there. We will look into it and keep you updated on the progress here.

Thanks TheDJ for bringing this here. Yes I have been noticing more and more of these poor diff views. Thankfully WikEd formats them properly. It would still be nice if the main diff viewer could handle these better.

TheDJ added a comment.Dec 6 2017, 12:16 AM

oh, i noticed that my screenshot uses the old styling for diffs btw, because i still had that gadget enabled by accident. sorry about that. gist is the same however.

Thanks @TheDJ and @Doc_James, we will look into it :)

Another very simple example: https://de.wikipedia.org/w/index.php?diff=173785263&oldid=167747951 The edit only appended some words at the end of a line, yet the diff displays them as completely changed.

Thanks @Schnark we expect to fix this class of regression in the diff with the upcoming fixes we're currently working on. They just need a bit more testing and adjusting so we get the best results for most cases.