Reevaluate algorithm that measures the percentage of unmodified contents for languages without spaces
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Pginer-WMF
	May 5 2020, 1:38 PM

Description

Content translation defines a set of limits to encourage users to review the initial automatic translations (and even prevent from publishing) when the user has not edited the contents enough.

As part of the efforts to adjust such system (T251887), we want to review how it works for Chinese and other languages that don't make use of spaces to separate words. The case of Chinese is particularly relevant since the limits were made more strict at the community request (T246383).

In particular, this ticket focuses on the way the percentage of modifications made by the user is calculated. For example, we want to verify that after changing one third of the contents of a Chinese text, the algorithm correctly calculates a 33% of modifications.

Examples will be documented in this ticket so that we can share with the community and get input from native speakers.

In addition to Chinese, other languages that don't use spaces will be used. Identifying those, and sharing a list in the ticket will be part of the work for this task.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T251887 Review the MT limits system and how it is presented to users
		Open		None	T251893 Reevaluate algorithm that measures the percentage of unmodified contents for languages without spaces

Event Timeline

Pginer-WMF triaged this task as Medium priority.May 5 2020, 1:38 PM

Pginer-WMF created this task.

Pginer-WMF mentioned this in T246383: Adjust the threshold for Chinese Wikipedia to prevent publishing when overall unmodified content is higher than 70%.May 5 2020, 1:41 PM

Pginer-WMF mentioned this in T251887: Review the MT limits system and how it is presented to users.May 6 2020, 7:29 AM

Pginer-WMF moved this task from Needs Triage to MT on the ContentTranslation board.May 8 2020, 5:42 PM

@Pginer-WMF We may need to reverse T246383 has the current MT limit system does not perform Chinese word segmentation before assigning the score, which causes all content created from CX to not be able to publish. Instead, it may be best to set ContentTranslationPublishRequirements to autoconfirmed users on Zh wiki. Do you want me to create a new ticket for this change?

Pginer-WMF mentioned this in T252371: Revert limit adjustment for Chinese translations with Content translation.May 11 2020, 7:59 AM

In T251893#6123337, @VulpesVulpes825 wrote:

@Pginer-WMF We may need to reverse T246383 has the current MT limit system does not perform Chinese word segmentation before assigning the score, which causes all content created from CX to not be able to publish. Instead, it may be best to set ContentTranslationPublishRequirements to autoconfirmed users on Zh wiki. Do you want me to create a new ticket for this change?

If it is causing problems, let's revert T246383 until the current ticket is resolved. I created a ticket for it: T252371: Revert limit adjustment for Chinese translations with Content translation.
I'd not recommends adding limits based on the user permissions or number of edits since that evaluates the user rather than the content. in the past that has been problematic (e.g., limiting experienced users in one wiki to publish good content in another wiki because of low activity there).

@Pginer-WMF Good idea. I agree with you about limiting user permission, but I am afraid of potential consequences to not fulfill community requests. It will be hard to communicate with the community about this reverse as they wanted to let the non-autoconfirmed user to not be able to use CX completely.

Xiplus subscribed.May 11 2020, 9:11 AM

SCP-2000 subscribed.May 11 2020, 10:01 AM

I am afraid that evaluating the user rather than the content is unavoidable, unless CX usage is completely disabled for all people in zhwiki. That's already the minimal impact on the community if the latter cannot be performed.

Pginer-WMF mentioned this in T252786: Make the threshold for Chinese Wikipedia to prevent publishing 5% more strict.May 14 2020, 3:26 PM

In T251893#6136411, @Sanmosa wrote:

I am afraid that evaluating the user rather than the content is unavoidable, unless CX usage is completely disabled for all people in zhwiki. That's already the minimal impact on the community if the latter cannot be performed.

By adjusting the limits we are trying to find a balance where it is possible to publish as many valid contributions as possible while preventing the low quality ones. Adjusting this is an iterative process, where we make adjustments, hear from the community impressions, look at the data, and make more adjustments (also discovering other issues to fix inthe way).
For this particular case, the next steps we want to follow are: make the limits 10% more strict (T252786), explore how to improve the algorithm that measures the user modifications (T251893), and consider further adjustments based on the results from both and the input from editors.

Looking at the 2020 data for a reference for future changes, in the January-April period in Chinese Wikipedia the deletion ratio for articles created with content translation and those created from scratch is around 5%.

Pginer-WMF edited projects, added Language-Team (Language-2020-July-September); removed Language-Team (Language-2020-Focus-Sprint).Aug 3 2020, 9:29 AM

Pginer-WMF mentioned this in T305049: Automatic evaluation of the limit system algorithm.Mar 30 2022, 3:00 PM

Pginer-WMF edited projects, added Language-Team (Language-2023-October-December); removed Language-Team (Language-2020-July-September).Nov 30 2023, 8:45 AM

The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.

and

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人久", 'zh')
0.5

In T251893#9369982, @santhosh wrote:
The current logic in CX for CJK group of languages(including chinese) is follows. The tokens are characters instead of words, so 人口 has 2 tokens.

and
> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人久", 'zh')
0.5

I was trying some additional examples and found some inconsistent behaviour in the edits belwo and the percentage of unmodified contents detected:

人七 → 人七久久久久久 (29% unmodified). Makes sense since 2 out of a total of 7 characters remain the same.
人七 → 人七人人人人人 (100% unmodified). Does not make sense since the content is different after the edit.

In Javascript console:

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人七久久久久久", 'zh')
0.2857142857142857

> mw.cx.TranslationTracker.static.calculateUnmodifiedContent("人七", "人七人人人人人", 'zh')
1

Reevaluate algorithm that measures the percentage of unmodified contents for languages without spacesOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Reevaluate algorithm that measures the percentage of unmodified contents for languages without spaces
Open, MediumPublic
Actions

Related Objects
Search...