Page MenuHomePhabricator

Consider reordering of words in the calculations for the machine translation limit system
Closed, ResolvedPublic4 Estimated Story Points

Description

The current limit system in Content Translaiton is intended to encourage users to edit the initial machine translations. However, modifications such as correcting the word order are not computed currently as a change. This ticket proposes those to be counted as user modifications too.

For example, if the original translation is "The fast is rabbit", changing it to "The rabbit is fast" should count as a modification of the original content.

Event Timeline

Pginer-WMF triaged this task as Medium priority.
Pginer-WMF added a subscriber: santhosh.

The current calculation:

Checks if each token in bigSet (the larger token array) exists anywhere in smallSet (the smaller token array), regardless of its position. The order of tokens doesn’t matter. For example, "hello world" and "world hello" would have 100% unchanged because both tokens ("hello" and "world") exist in both strings, even though their positions differ.


The calculation consider order (only the basic):
DEMO: https://codepen.io/Huei-Tan/pen/YPPPzeb

  • A B -> B A, this is 100% modified content
  • A B -> A B C, this is 33% modified content
  • A B -> C A B, this is 100% modified
  • A B C D -> A B X C D, this is 60% modified content
  • A B C D E -> A X C X E, this is 40% modified content

Consider order with some meaningful content:
DEMO: https://codepen.io/Huei-Tan/pen/GggNYMz

  • This is a cute cat -> This is a very cute cat
    • only the "very" was added, so is this
      • 3/6 modified (because "very cute cat" are not in order); or
      • 1/6 modified (consider "cute cat" still in same order)?

If we consider only the very basic order calculation, in order for user to pass by our unmodified limit system, they just need to add a simple text then everything is not in order.

fyi @Pginer-WMF

abi_ set the point value for this task to 4.Apr 23 2025, 2:53 AM
abi_ subscribed.

I've set this to 4 SP. Please update as required

  • This is a cute cat -> This is a very cute cat
    • only the "very" was added, so is this
      • 3/6 modified (because "very cute cat" are not in order); or
      • 1/6 modified (consider "cute cat" still in same order)?

I'd consider it as 1/6 modified.

Intuitively, I think of the percentage of modification as if we take a diff, and blurry the result into a progress bar:

Artboard.png (209×364 px, 10 KB)

For reordering, diff processing seems to be able to detect the word that has been moved (used this tool ):

www.diffchecker.com_text-compare_.png (1×800 px, 86 KB)

I don't have the technical details on how the usual diff algorithms work, or how they differ from the ones used in Content Translation. @santhosh may have more details about these aprpoaches and the benefits/limitations of each one.

I code it using the diff checker as the reference

Left is diff checker / Right is our tool demo (https://codepen.io/Huei-Tan/pen/GggNYMz)

4/6 unmodified:

image.png (1×2 px, 217 KB)

4/5 unmodified:

image.png (1×2 px, 198 KB)

I have been trying the demo tool with some examples comparing it with Content Translation and a diff tool as reference. Overall, I have the impression that the new algorithm works better than the current one. It is more sensitive to changes, thus it would account for more of the user modifications and avoid cases where the user has modified the translation and the tool prevents them from publishing. In any case, we may want to keep an eye on deletion rates if the change is applied.

Sharing the individual tests below (screenshots consist of the test tool for the new algorithm on the top left with the diff below it, and the current algorithm in Content Translation on the right):

Test 1: few edits

Unmodified contents: 93% (new) vs 97% (current). This level of modification would be allowed to be published with the new algorithm, while it would be prevented with the current one.

Screenshot 2025-04-24 at 17.10.56.png (2×5 px, 1 MB)

Test 2: reordering

Unmodified contents: 89% (new) vs 100% (current). Moving part of a sentence to a different position is not detected as a change by the current algorithm. The new algorithm considers that 11% of the contents have been modified.

Screenshot 2025-04-25 at 13.27.07.png (1×2 px, 713 KB)

Test 3: heavier reordering

Unmodified contents: 53% (new) vs 100% (current). Swapping about half of the paragraph of position (top part to bottom) is not detected as a change by the current algorithm. The new algorithm considers that 47% of the contents have been modified.

Screenshot 2025-04-25 at 13.31.17.png (2×5 px, 1 MB)

Test 4: Medium-low editing intensity on longer paragraph

Unmodified contents: 87% (new) vs 91% (current).

Screenshot 2025-04-25 at 13.44.42.png (2×5 px, 1 MB)

Test 5: Non-latin script (target)

Unmodified contents: 87% (new) vs 96% (current). This level of modification would be allowed to be published with the new algorithm, while it would be prevented with the current one.

Screenshot 2025-04-25 at 14.27.15.png (2×5 px, 1 MB)

Test 6: Non-latin (both)

Unmodified contents: 93% (new) vs 97% (current). This level of modification would be allowed to be published with the new algorithm, while it would be prevented with the current one.

Screenshot 2025-04-25 at 14.42.20.png (2×5 px, 2 MB)

Two algorithm to compare, we need to benchmark the longer paragraph and see the result

  1. LCS https://codepen.io/Huei-Tan/pen/GggNYMz
  2. Levenshtein https://codepen.io/Huei-Tan/pen/XJJVqwp
hueitan changed the task status from Open to In Progress.May 1 2025, 8:47 AM

Change #1142493 had a related patch set uploaded (by Huei Tan; author: Huei Tan):

[mediawiki/extensions/ContentTranslation@master] WIP: CX: Update the calculation of the machine translation limit system

https://gerrit.wikimedia.org/r/1142493

Change #1142493 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX: Update the calculation of the machine translation limit system

https://gerrit.wikimedia.org/r/1142493

Change #1152798 had a related patch set uploaded (by Sbisson; author: Sbisson):

[mediawiki/extensions/ContentTranslation@master] CX3 Build 1.0.0+20250602

https://gerrit.wikimedia.org/r/1152798

Change #1152798 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] CX3 Build 1.0.0+20250602

https://gerrit.wikimedia.org/r/1152798