Page MenuHomePhabricator

Show word-level diff for terms and sitelinks in Wikibase diffs
Closed, ResolvedPublic13 Estimated Story PointsFeature

Description

User story:
As a Wikidata editor patroling some edits,
I want to spot the actual changes more easily
in order to assess them more efficiently.

Problem:
Currently, Wikibase highlights the whole fields in diffs for items and properties. I’d like to have word-level diffs, just like how it works on non-entity pages, for any textual fields, including terms (labels, descriptions, aliases), sitelink targets and all textual properties’ values (strings, monolingual strings, Commons files etc.).

Example:
For example, do you see the difference in this diff? (Tip: in the fourth word, е was changed to є. I found it using NavPopups (which does word-level diff on the JSON, so it shows u0435u0454) and then looking very closely at it.)

Solution:
Do word-level diff on textual fields in Wikibase diffs

Mockups:

How it currently looks likeHow it should look like
wikibase-word-level-diff-before.png (292×1 px, 27 KB)
wikibase-word-level-diff-after.png (293×1 px, 26 KB)

Acceptance criteria:

  • Diff shows word-level differences for terms and sitelinks in Wikibase.

Notes:

Open questions:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

(After spending at least ten minutes with handcrafting the screenshots using Firefox’ developer tools, now I noticed that I left off the Property / Commons category heading on the left side. 🙁 I won’t redo them, I hope you get the point anyway.)

Thank you! The illustration is very helpful. I agree that this would be a very useful improvement, especially for patrollers.
@Addshore @ItamarWMDE Any idea how big an endeavor this could be?

I'd assess this endeavor to be of medium effort. From what I can gather, there are a couple of places where we need to change the way diffs are rendered: Firstly, we might want to use or extend the WordLevelDiff class from core's Diff namespace inside our own data-model-services sub-package's EntityDiff and / or ItemDiff classes. Then, we might want to update the way terms and claim diffs are rendered within BasicDiffView and ItemDiffView in the Wikibase/Repo/Diff namespace, to highlight only the changed words rather than the whole changed sentence.

I'd still be happy to hear what @Addshore thinks of it, though, just as a sanity check.

I'd still be happy to hear what @Addshore thinks of it, though, just as a sanity check.

Nothing really to add that I can think of from my side.
Sounds good to me!

Manuel renamed this task from Do word-level diff on textual fields in Wikibase diffs to Show word-level diff on textual fields in Wikibase diffs.May 4 2022, 3:53 PM
Manuel updated the task description. (Show Details)
Manuel updated the task description. (Show Details)
karapayneWMDE set the point value for this task to 13.Jun 14 2022, 10:09 AM

Task Breakdown Notes

  • If any subtasks are created, they should be created in the currently running sprint board, to be picked up there.
  • The classes might be involved in detecting merge conflicts, so we might need to tread carefully
  • We should probably try to work our way from BasicDiffView and ItemDiffView and see what we need to change along the line, but avoid touching the merge and conflict resolution functionalities. i.e. the way the diff is programmatically represented should probably not be touched.
  • We can use WordLevelDiff to try and achieve the acceptance criteria, but in case we need to extend or modify it, we should consider the MediaWiki Stable Interface Policy.
  • Make sure to consider whether the core class WordLevelDiff is marked as "newable", meaning that we are able to instantiate it outside of core

Understanding the topic better @noarave will create a separate task from these notes.

  • Figure out where we use WordLevelDiff, but also consider alternative solutions to understand our way forward.
  • The investigation should focus on WordLevelDiff and our possibilities to use it, or, if need be, modifying it, so we could use it.
  • It appears the Tech wishes is using it, so asking Adam or Svantje might be a good option.

Make sure to consider whether the core class WordLevelDiff is marked as "newable", meaning that we are able to instantiate it outside of core

WordLevelDiff was last substantially changed in 2016, so it predates the stable interface policy (created beginning of 2017). To me it seems plausible that marking it newable was simply not done yet, but there wouldn’t be any reasons not to do it either. (Alternatively, TableDiffFormatter is already newable and uses WordLevelDiff; we might be able to use TableDiffFormatter.)

@Lucas_Werkmeister_WMDE I linked to this comment in the subtask so it doesn't escape us.

Change 810017 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Rename item diff related classes for clarity

https://gerrit.wikimedia.org/r/810017

Change 810017 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Rename item diff related classes for clarity

https://gerrit.wikimedia.org/r/810017

Change 810302 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[mediawiki/extensions/Wikibase@master] Prepare switching to WordLevelDiff in BasicDiffView

https://gerrit.wikimedia.org/r/810302

Change 810315 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[mediawiki/extensions/Wikibase@master] Use WordLevelDiff for Labels/Description/Aliases

https://gerrit.wikimedia.org/r/810315

Change 810315 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[mediawiki/extensions/Wikibase@master] Use WordLevelDiff for Labels/Description/Aliases

https://gerrit.wikimedia.org/r/810315

This implements word-level diffs for the contents of labels, descriptions, and aliases. I think diffing the titles of sitelink changes would also be fairly doable; diffing statements feels like more work to me, both to define (e.g. should we diff the labels of item values word by word, or just treat the whole thing as one change, because after all the whole item value was changed to a different ID?) and to implement. Should either of those be part of this task as well?

Good thinking! Yes, let's please diff the titles of sitelink changes as well! I don't want to increase complexity, let's not do the word-by-word diffs for statements for now. If the community wants this at some point, I will create another task for it.

Change 810302 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Prepare switching to WordLevelDiff in BasicDiffView

https://gerrit.wikimedia.org/r/810302

Change 810361 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Prepare switching to WordLevelDiff in SiteLinkDiffView

https://gerrit.wikimedia.org/r/810361

Change 810362 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/extensions/Wikibase@master] Use WordLevelDiff for site link titles

https://gerrit.wikimedia.org/r/810362

[…] diffing statements feels like more work to me, both to define (e.g. should we diff the labels of item values word by word, or just treat the whole thing as one change, because after all the whole item value was changed to a different ID?) and to implement.

I don't want to increase complexity, let's not do the word-by-word diffs for statements for now.

Word-by-word diffs for entity-typed statements (or sitelink badges) make little sense, but I’d like to have them for unstructured-string-valued statements as well—see the Commons category in the screenshot. Of course, if it’s way more difficult, I’d be happy if word-by-word diffs for terms and sitelinks rolled out for now, but I’d like this ticket not to be resolved until textual statements have word-by-word diffs as well (or create a new task and close this one if that better fits your workflow).

Change 810315 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use WordLevelDiff for Labels/Description/Aliases

https://gerrit.wikimedia.org/r/810315

Change 810361 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Prepare switching to WordLevelDiff in SiteLinkDiffView

https://gerrit.wikimedia.org/r/810361

Change 810362 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use WordLevelDiff for site link titles

https://gerrit.wikimedia.org/r/810362

Should be verifiable on Beta soon; will only be deployed to production around July 20 (no deployment train next week).

Yay, looking great! :)

One thing that I noticed: Unchanged spaces after a changed word are highlighted In the example diffs for some reason. This is not an issue for me, but I am still mentioning just in case, as it's different from how diffs work on Wikipedia.

Hm, on my local wiki I get this behavior for wikitext diffs as well:

image.png (388×1 px, 56 KB)

But not on Wikipedia:
image.png (706×1 px, 137 KB)

Not sure what’s going on here…

Hm, but on Beta, the space isn’t included in a wikitext word-level diff either.

image.png (539×1 px, 86 KB)

Ohhh, of course, wikitext diffs can use wikidiff2 or an external diff executable, not necessarily WordLevelDiff

I don’t think we should use wikidiff2, though. The inline diff functions don’t seem useful for us, since we want to show a side-by-side diff:

>>> wikidiff2_inline_diff( 'This is a example', 'This is an example', 2 )
=> """
   <div class="mw-diff-inline-header"><!-- LINES 1,1 --></div>\n
   <div class="mw-diff-inline-changed">This is <del>a</del><ins>an</ins> example</div>\n
   """
>>> wikidiff2_inline_json_diff( 'This is a example', 'This is an example', 2 )
=> "{"diff": [{"type": 3, "lineNumber": 1, "text": "This is aan example", "offset": {"from": 0,"to": 0}, "highlightRanges": [{"start": 8, "length": 1, "type": 1 },{"start": 9, "length": 2, "type": 0 }]}]}"

And the regular diff contains too much extra HTML that we’d have to strip out again (the <!--LINE--> numbers get localized later, but make no sense in our context):

>>> wikidiff2_do_diff( 'This is a example', 'This is an example', 2 )
=> """
   <tr>\n
     <td colspan="2" class="diff-lineno"><!--LINE 1--></td>\n
     <td colspan="2" class="diff-lineno"><!--LINE 1--></td>\n
   </tr>\n
   <tr>\n
     <td class="diff-marker" data-marker="−"></td>\n
     <td class="diff-deletedline diff-side-deleted"><div>This is <del class="diffchange diffchange-inline">a</del> example</div></td>\n
     <td class="diff-marker" data-marker="+"></td>\n
     <td class="diff-addedline diff-side-added"><div>This is <ins class="diffchange diffchange-inline">an</ins> example</div></td>\n
   </tr>\n
   """

Let’s just stick to WordLevelDiff for now.

Sounds good, thx for checking this out!

It would still be important to be consistent within a wiki, including whitespace handling. These minor differences can be very hard to understand, much harder than the big differences like the current status quo of no word-level diffs (which doesn’t mean the word-level diffs are not worth it!). Also, it seems to not only include the unchanged whitespace after a word, but also exclude the changed whitespace before a word, e.g. here it ignored the space before the parenthesis.

Thank you @Tacsipacsi, this is indeed weird!

@Lucas_Werkmeister_WMDE why would WordLevelDiff give results like that? This might not be intended. Is there something we could do about it?

Is there something we could do about it?

Well, we could spend some time digging into the WordLevelDiff class, understanding how it works, trying to improve its behavior without breaking anything, and end up with something that will definitely still behave differently from wikidiff2 in some details. I don’t think that would be worth it. The current WordLevelDiff is considered good enough for wikitext diffs on a standard MediaWiki install, after all (since the wikidiff2 extension isn’t usually installed, I assume).

I can confirm this behavior of WordLevelDiff on my local wiki as well. I never noticed!
Digging into the class would undoubtedly be beyond the reasonable scope for this:

Time to celebrate that we will soon have word-level diffs on Wikidata!! \o/

Sorry, I somehow missed closing this ticket. The acceptance criterion was not met: the criterion mentions “textual fields”, not “terms and sitelinks”; textual statements are still missing.

Thank you @Tacsipacsi! Could you specify the cases where diffs for unstructured-string-valued statements would be particularly useful (you mentioned Commons categories before)? Based on your input I will then discuss with @Lucas_Werkmeister_WMDE to better understand what complexity this would add and how to best operationalize it in our process.

I don’t have concrete examples now, but categories that come to my mind where it’s common to have longer texts:

  • Monolingual texts (work titles, quotations, addresses, property usage instructions etc.)
  • Source text (scores, formulæ, regexes etc.)

But it can be particularly useful for Commons category names, file names, author names etc. as well if only a tiny typo was fixed in a longer name.


I realized that “textual fields” criterion is not very specific; I think the specific criterion could be terms, sitelinks and snaks with value types string and monolingualtext (of which terms and sitelinks are already done). Looking at the list of data types, all string- and monolingualtext-valued data types could all more or less benefit from word-level diffs (likely formulæ more and external identifiers less, but it may happen that external IDs are multi-word, and it’s probably easier anyway to implement it for all string-valued data types than only for some of them).

Manuel renamed this task from Show word-level diff on textual fields in Wikibase diffs to Show word-level diff for terms and sitelinks in Wikibase diffs.Dec 13 2022, 11:36 AM
Manuel updated the task description. (Show Details)

I have now clarified the scope of this task and split your request out into T325054: Show word-level diff for snaks with value types string and monolingual text in Wikibase diffs . Let's continue the discussions there. Thanks again!