Page MenuHomePhabricator

Adjust the system of limits to make it more predictable
Closed, DeclinedPublic

Description

In Content Translation, a system of limits encourage users to review the initial translations.

The current system makes a decision (prevent publishing, warn or add to a tracking category) based on several factors including the total percentage of unmodified contents in the translation, the number of problematic paragraphs, whether those were marked as reviewed and whether the user had translations deleted in the previous month.

Making the decisions based on the number of paragraphs seems to introduce some problems. For example, longer articles are more likely to include content that can generate false positives such as math formulas (T245827).

This ticket proposes to simplify the system of limits so that the decision is made based on the overall percentage of unmodified content, and other factors modify it to make such global limit more or less strict.

Proposed approach

Prevent publishing

Translations can be published only when the percentage of user modifications for the whole translation is not higher than a given limit.

This ticket proposes to calculate such limit in the following way:

  • Use the default global limit as a starting point.
  • If the user has previously deleted translations, make the limit more strict.
  • If there are problematic paragraphs, make the limit more strict.
    • if some of the paragraphs have been marked as resolved, make the limit less strict. This adjustment should be proportional to the number of paragraphs, but even when all problematic paragraphs were marked as resolved, the limits should be stricter than by default.

This is defined more precisely in the formula below:

limit = i - has_deleted * i/3 - has_problematic*( i/6) + has_problematic*((r/p)*i/12)
  • Initial limit (i). This will be the default limit set for all wikis (99%) or a specific limit set for the current wiki (e.g., 70% for Assamese T245509).
  • Has deleted translations (has_deleted). A boolean indicator representing whether any translation created by the user in the last month has been deleted (has_deleted = 1) or not (has_deleted=0).
  • Has problematic paragraphs (has_problematic). A boolean indicator representing whether the translation has 2 or more problematic paragraphs (has_problematic = 1) or not (has_problematic=0). A paragraph is considered problematic when its percentage of unmodified content crosses the barrier of initial limit (i) - 5% for that paragraph.
  • Number of problematic paragraphs (p). The number of problematic paragraphs in the translation.
  • Number of problematic paragraphs marked as reviewed (r). Problematic paragraphs that the user marked as reviewed.

Adding to a tracking category
To simplify the application and measurement of translations likely to still contain too much unmodified contents, we want to make the following changes:

  • Make the decision based on the total percentage of unmodified contents. Once the limit has been determined, the translations that are in the range of [limit - 10%, limit] will be added to the tracking category. That is, if the computed limit for publishing is 70%, translations with 60-70% of unmodified translation will be added to the tracking category.
  • Add an editing tag. The page published will include an "unreviewed-translation" edit tag. This will allow tracking and measuring the survival of those translations that were published closer to the limit.

Examples of application

For a translation to Catalan with 1 paragraph of unedited MT, a limit of 99% will be applied. Because it is the default limit for all wikis, Catalan Wikipedia has not defined a specific limit, and the other modifications do not apply.

For a translation to Catalan with 5 paragraphs of unedited MT, a limit of 82.5% will be applied. 99% is the default limit, but a reduction of 1/6 of the initial limit is applied when there are problematic paragraphs not marked as resolved.

For a translation to Catalan with 5 paragraphs of unedited MT where the user marked all of them as resolved, a limit of 90.75% is applied since a reduction of 1/6 is applied having problematic paragraphs but it is compensated by a 1/12 increase for having all paragraphs marked as resolved.

For a translation to Indonesian (where the default limit is 70%) with 5 paragraphs of unedited MT where the user marked 3 of them as resolved, a limit of 61.8% will be applied. Since the Indonesian limit of 70% gets a 1/6 reduction and only 3/5 of the compensation for marking 3 out of 5 paragraphs marked as resolved.

For a translation to Indonesian with 5 paragraphs of unedited MT marking all of them as resolved but the user had a previously deleted translation the last week, a limit of 40.8% will be applied since an additional 1/3 penalization is applied because of previously deleted translations.

These examples can be used as test cases to verify the approach was correctly implemented.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Feb 21 2020, 4:01 PM
Pginer-WMF moved this task from Needs Triage to Enhancements on the ContentTranslation board.

After reading the proposed calculation method, I would recommend removing "more predictable" from the title. This ticket proposes to make the translation limit as a variable that is calculated at run time with penalty for previous deletions by translator and problematic paragraphs at the course of translation. So as a translator I may have the limit varying between 99% and roughly 40%.

The data required for calculating this at run time is available(or can be made available).

A few concerns I have:

  1. The UI does not communicate the translator that this is your current limit to get an idea of limit. We communicate the unmodified percentage in "issue cards". But does not communicate how much is acceptable. This become more problematic if the limit is fluctuating. Even in the publish error message, we do not tell this limit to user.
  2. Previous Deletions are already part of the limit - but it is used at the time of publishing. Need clarification on whether we are changing that logic in favor of the proposed one
  3. A single deleted translation becomes a huge penalty for the translator. A limit of 99% get reduced to 66% by having a single deletion by the translator. In real world, deletions can happen because of various reasons. The number of deletions should not be taken as absolute number but relative to the total number of published articles. If I translated 100 articles in last month and deleted 1, then for the rest of the month, that translator has to use ~66% as MT usage. Achieving 66% limit is quite hard in case you have reasonably good MT. Since we do not tell this reasoning or our limit, the user will feel it as a puzzle. "How much more I need to change? I used to be publishing articles without this much change till yesterday?"
  4. The proposed solution does not consider the current implementation details such as "High MT sections" where we inspect the MT misuse with in sections. This approach was a mitigation to avoid uniform spread of "badness" in sections if one section is MT misused. Secondly it was introduced to consider varying size of sections in an article. The proposed formulae considers each section as equal.
  5. With i=70%, and single deleted translation, you get 47% as limit. This would mean a translator has to change half of initial MT content. I don't think this will help.

In general, the basic principle of MT abuse limit was to prevent MT abuses. This had no connection with translation deletions. But here we are making a strong assumption that if an article is deleted after published by translation, it is most like due to MT misuse. And subsequently we make the editor work hard to get the translation within MT usage limits which we don't communicate at all. I think we should not conflate MT quality and deletion ratio like this. Deletion can happen say, a week later the translator published 1000 articles.

There are MT engines with varying degree of quality. So we may need to test this against best and worst case initial MT.

From the implementation perspective, one of biggest challenge we have with the MT system is lack of real world testing with all these scenarios combined. It is nearly impossible to automate these tests. It is hard to get various run time variables to test the scenarios. As a developer, a system that I cannot validate and prove that it works as expected is a nightmare. If a translator complaints that they are not able to publish a translation, can we debug and pin point what caused that behavior? It is high time that we build some tooling to face that situations similar to translation debugger. I don't want to say we made it more predictable but we don't know what happened in your case.

Thanks for the input @santhosh, based on that I think we need to discuss and refine a bit more the proposal.
I agree on several of the points you raised:

  • We need to better communicate the limits to users.
  • A single deleted translation should not be a huge penalty for the translator. Note that the proposal was to look only in the last month to avoid the more strict limit to last forever. How much the limit is made more strict and the time window can be adjusted (and probably should, based on the examples you provided). In any case, I still think that we need to account for deletions in some way: when we get feedback from communities about the deletion of translations, quality and lack of editing are often surfaced, as well as spikes during contests. Not all deletions are about this, but I think it is useful to encourage users to review their content more if it was deleted recently. For context, deleted translations are a 4% in general based on previous year's data, but on specific wikis and periods we've seen reach 96% of deletions as it was for Javanese.

There are a couple of other points that I don't agree completely (or maybe I'm not understanding well):

  • Regarding "not consider the current implementation details such as 'High MT sections'" I think this is one of the problematic parts of the current approach. In our current approach the overall limit for the document is 99%, and translations are only prevented from published when they have 50 or more problematic paragraphs. This means that the following problematic cases can occur:
    • A user can translate and publish an article with 40 paragraphs even if all paragraphs have 95% of unmodified contents.
    • A user will be blocked from publishing a very long article if it contains 50 items that are not expected to be translated ( original text quote or book title, etc.) even if they are a small part of the article.
  • Regarding being "nearly impossible to automate these tests", I was thinking that we could create two groups of samples: (a) articles that are expected to be allowed to publish, and (b) those expected to be prevented from published. Those samples can contain the same structure used in the parallel corpora, and be used to evaluate a given approach based on how many of the samples are properly classified. When communities report issues of low quality content or good content prevented from publishing, those can be incorporated to the catalog of samples.

Based on the feedback and learnings after analyzing some problematic cases of low quality spikes, I proposed a new approach in T347272: Simplify the system of limits to make it more predictable

I'll close the current ticket to avoid confusion.