Page MenuHomePhabricator

Wikidiff2 considers each space to be a separate word
Open, Needs TriagePublic

Description

The Word class in Wikidiff2 has a "body" part, which is supposed to contain the letters of the word, and the "suffix" part, which is supposed to contain the spaces following the word. However, a bug in TextUtil::explodeWords() from 2010 causes each space to always be a separate word. The suffix part is only populated when the body part is empty.

Fixing the bug improves performance on English text by 27% according to bench.php. However, it will have user-visible impacts, which I would like to seek input on.

Space insertion currently:

before.png (143×1 px, 29 KB)

Space insertion with spaces joined to preceding words:

after.png (145×1 px, 32 KB)

So the question is:

  • What should space insertion look like?
  • Does it matter enough to warrant a 27% slowdown?

If the answer is that it should continue to work like it has since 2010, then we should get rid of the "suffix" concept in the Word class and clean up TextUtil::explodeWords() accordingly.

Event Timeline

Change 879173 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] [WIP] Include spaces as the word suffix

https://gerrit.wikimedia.org/r/879173

Change 879174 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/php/wikidiff2@master] Remove word suffix concept

https://gerrit.wikimedia.org/r/879174

There was a comment indicating that suffix was only supposed to contain a single space, not all spaces following the word. That would make more sense and would correctly highlight the added space in the screenshot above. However, insertion of a single space between letters and punctuation would not be highlighted correctly. And with the current formatting code, changing only letters would lead to the following space being highlighted even when it is unchanged.

Removing the suffix concept is probably the best option at this point.

Either way, it's not ideal. With the current wikidiff2, if you match a paragraph of text with another paragraph that uses none of the same words, the LCS is just the spaces. You can even match a text with a string that contains only spaces, and each space will be in the LCS. It's a reminder that displaying diffs for humans is not just a matter of finding the shortest edit script.

Maybe a better idea would be to do a diff with the spaces joined to words, and then do a space-splitting pass which will make local adjustments to the edit boundaries.

Change 879173 abandoned by Tim Starling:

[mediawiki/php/wikidiff2@master] [WIP] Include spaces as the word suffix

Reason:

https://gerrit.wikimedia.org/r/879173

In T117279: [EPIC] Core should provide inline diffs as well as side by side (Move InlineDifferenceEngine into core / remove MobileDiff), I posted an example of how bad this can get in MobileDiff:

Screenshot_20231016_143926_Firefox.jpg (2×1 px, 695 KB)

In addition to joining together words and spaces, another thing that could help improve the output in situations like this would be joining together substitutions that have too small of a separator between them, as measured by word count, word frequency, and/or character count. This would probably negate any time improvements though.