Page MenuHomePhabricator

Paragraph indicated as moved, though it wasn't
Closed, DuplicatePublicBUG REPORT

Description

https://de.wikipedia.org/w/index.php?title=National_Register_of_Historic_Places_in_Arizona&curid=9675366&diff=171746522&oldid=171745883 has a paragraph which is indicated as moved, but the paragraph wasn't moved, since all paragraphs between the "old" and the "new" place have been deleted in the edit.

Event Timeline

And the other way round, too: https://de.wikipedia.org/w/index.php?title=Maria_Canins&curid=5138507&diff=171746679&oldid=171745939
Most of the "moved" paragraphs weren't moved, the diff totally obscures the fact that only a) headlines were inserted, and b) paragraphs have been split.
The only real move in this example is the block with the references.

I had a look at the provided examples and compared what the table at https://wmde-wikidiff2-unpatched.wmflabs.org/core/index.php/Main_Page#Test_Pages provides in the "old" and "new" columns:

  • The diff for "National Register of Historic Places in Arizona" looks better than before. Instead of deleting |Navigationsleiste Listen der National Historic Places der Vereinigten Staaten and inserting {{Navigationsleiste Listen der National Historic Places der Vereinigten Staaten}}, the algorithm successfully detects that these two lines are almost identical (except for the special characters), and links them.
  • I'm afraid I don't understand what's wrong with the "Maria Canins" diff. When I compare old and new algorithm, the new seems to be so much more helpful. Large paragraphs are linked, instead of just appearing deleted and inserted. The appearance of the headlines is not very different between the two, so I really can't tell what is "obscured" there.
  • The diff for "Frankenhardt" is just … a bit different. The amount of information I can see in the two diffs is pretty much identical. The old algorithm linked paragraphs purely based on their position, which was ok for this particular diff. The new algorithm links paragraphs when they are significantly similar. This results in a diff that's a little more spread out, but not worse, from what I can see.

I'm afraid there is nothing we can do here at the moment. Can you help us understand the problem better?

Old diff (1):

1.png (180×1 px, 14 KB)

Current diff (2):

2.png (216×1 px, 14 KB)

Best diff (3):

3.png (165×1 px, 11 KB)

Of course, (2) is better than (1), but that is no reason why it shouldn't be (3), which is even better than (2), and that's what this task is about.

thiemowmde removed a project: Regression.
thiemowmde added a subscriber: WMDE-Fisch.

Ah, I see. Thanks for the clarification! The problem here is the way two separate algorithms work together: The first line-matching algorithm does it's best to put identical lines side by side. Only then the second moved paragraph algorithm (this is the one we introduced recently) tries to re-match disjoint lines and links them together. Having these two algorithms separated is required for situations where lines have been moved.

What we see in screenshot (2) is an edge-case where the second algorithm is a bit more clever in finding matches than the first. This is by design. The line-matcher is intentionally trivial and does not try to match "almost identical" lines, but only lines that are either truly identical, or at the same position. Neither is the case in the provided example. Only then the second algorithm kicks in, finds a match, but does not move anything. This is also by design.

Yes, it's technically possible to introduce special-case handling for situations like this. I can imaging two approaches:

  • Try to make the line-matcher "softer" so it can find vague matches the same way the moved paragraph algorithm does. The biggest problem here is: this is heavily CPU intensive and probably nothing we can do.
  • Apply a third "clean-up" algorithm after everything else was done. This "clean-up" could look for situations like shown in the screenshot where a move was detected, and there is an empty gap in the diff that allows us to actually move the paragraph.

I need to point out this is not a Regression but a new feature. We will see if we find the time to work on this.

awight changed the subtype of this task from "Task" to "Bug Report".Oct 30 2020, 1:54 PM

This bug seems to describe a more general issue with the visualization of diffs that is already covered in another ticket. I'll merge it accordingly.