Page MenuHomePhabricator

Reevaluate algorithm that measures the percentage of unmodified contents for languages without spaces
Open, MediumPublic

Description

Content translation defines a set of limits to encourage users to review the initial automatic translations (and even prevent from publishing) when the user has not edited the contents enough.

As part of the efforts to adjust such system (T251887), we want to review how it works for Chinese and other languages that don't make use of spaces to separate words. The case of Chinese is particularly relevant since the limits were made more strict at the community request (T246383).

In particular, this ticket focuses on the way the percentage of modifications made by the user is calculated. For example, we want to verify that after changing one third of the contents of a Chinese text, the algorithm correctly calculates a 33% of modifications.

Examples will be documented in this ticket so that we can share with the community and get input from native speakers.

In addition to Chinese, other languages that don't use spaces will be used. Identifying those, and sharing a list in the ticket will be part of the work for this task.

Event Timeline

Pginer-WMF triaged this task as Medium priority.May 5 2020, 1:38 PM
Pginer-WMF created this task.

@Pginer-WMF We may need to reverse T246383 has the current MT limit system does not perform Chinese word segmentation before assigning the score, which causes all content created from CX to not be able to publish. Instead, it may be best to set ContentTranslationPublishRequirements to autoconfirmed users on Zh wiki. Do you want me to create a new ticket for this change?

@Pginer-WMF We may need to reverse T246383 has the current MT limit system does not perform Chinese word segmentation before assigning the score, which causes all content created from CX to not be able to publish. Instead, it may be best to set ContentTranslationPublishRequirements to autoconfirmed users on Zh wiki. Do you want me to create a new ticket for this change?

If it is causing problems, let's revert T246383 until the current ticket is resolved. I created a ticket for it: T252371: Revert limit adjustment for Chinese translations with Content translation.
I'd not recommends adding limits based on the user permissions or number of edits since that evaluates the user rather than the content. in the past that has been problematic (e.g., limiting experienced users in one wiki to publish good content in another wiki because of low activity there).

@Pginer-WMF Good idea. I agree with you about limiting user permission, but I am afraid of potential consequences to not fulfill community requests. It will be hard to communicate with the community about this reverse as they wanted to let the non-autoconfirmed user to not be able to use CX completely.

I am afraid that evaluating the user rather than the content is unavoidable, unless CX usage is completely disabled for all people in zhwiki. That's already the minimal impact on the community if the latter cannot be performed.

I am afraid that evaluating the user rather than the content is unavoidable, unless CX usage is completely disabled for all people in zhwiki. That's already the minimal impact on the community if the latter cannot be performed.

By adjusting the limits we are trying to find a balance where it is possible to publish as many valid contributions as possible while preventing the low quality ones. Adjusting this is an iterative process, where we make adjustments, hear from the community impressions, look at the data, and make more adjustments (also discovering other issues to fix inthe way).
For this particular case, the next steps we want to follow are: make the limits 10% more strict (T252786), explore how to improve the algorithm that measures the user modifications (T251893), and consider further adjustments based on the results from both and the input from editors.

Looking at the 2020 data for a reference for future changes, in the January-April period in Chinese Wikipedia the deletion ratio for articles created with content translation and those created from scratch is around 5%.

The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.

and

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人久", 'zh')
0.5

The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.

and

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人久", 'zh')
0.5

I was trying some additional examples and found some inconsistent behaviour in the edits belwo and the percentage of unmodified contents detected:

  • 人七 → 人七久久久久久 (29% unmodified). Makes sense since 2 out of a total of 7 characters remain the same.
  • 人七 → 人七人人人人人 (100% unmodified). Does not make sense since the content is different after the edit.

In Javascript console:

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人七久久久久久", 'zh')
0.2857142857142857

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人七人人人人人", 'zh')
1