In Content Translation, a system of limits encourage users to review the initial translations.
The current system makes a decision (prevent publishing, warn or add to a tracking category) based on several factors including the total percentage of unmodified contents in the translation, the number of problematic paragraphs, whether those were marked as reviewed and whether the user had translations deleted in the previous month.
Making the decisions based on the number of paragraphs seems to introduce some problems. For example, longer articles are more likely to include content that can generate false positives such as math formulas (T245827).
This ticket proposes to simplify the system of limits so that the decision is made based on the overall percentage of unmodified content, and other factors modify it to make such global limit more or less strict.
Translations can be published only when the percentage of user modifications for the whole translation is not higher than a given limit.
This ticket proposes to calculate such limit in the following way:
- Use the default global limit as a starting point.
- If the user has previously deleted translations, make the limit more strict.
- If there are problematic paragraphs, make the limit more strict.
- if some of the paragraphs have been marked as resolved, make the limit less strict. This adjustment should be proportional to the number of paragraphs, but even when all problematic paragraphs were marked as resolved, the limits should be stricter than by default.
This is defined more precisely in the formula below:
limit = i - has_deleted * i/3 - has_problematic*( i/6) + has_problematic*((r/p)*i/12)
- Initial limit (i). This will be the default limit set for all wikis (99%) or a specific limit set for the current wiki (e.g., 70% for Assamese T245509).
- Has deleted translations (has_deleted). A boolean indicator representing whether any translation created by the user in the last month has been deleted (has_deleted = 1) or not (has_deleted=0).
- Has problematic paragraphs (has_problematic). A boolean indicator representing whether the translation has 2 or more problematic paragraphs (has_problematic = 1) or not (has_problematic=0). A paragraph is considered problematic when its percentage of unmodified content crosses the barrier of initial limit (i) - 5% for that paragraph.
- Number of problematic paragraphs (p). The number of problematic paragraphs in the translation.
- Number of problematic paragraphs marked as reviewed (r). Problematic paragraphs that the user marked as reviewed.
Adding to a tracking category
To simplify the application and measurement of translations likely to still contain too much unmodified contents, we want to make the following changes:
- Make the decision based on the total percentage of unmodified contents. Once the limit has been determined, the translations that are in the range of [limit - 10%, limit] will be added to the tracking category. That is, if the computed limit for publishing is 70%, translations with 60-70% of unmodified translation will be added to the tracking category.
- Add an editing tag. The page published will include an "unreviewed-translation" edit tag. This will allow tracking and measuring the survival of those translations that were published closer to the limit.
Examples of application
For a translation to Catalan with 1 paragraph of unedited MT, a limit of 99% will be applied. Because it is the default limit for all wikis, Catalan Wikipedia has not defined a specific limit, and the other modifications do not apply.
For a translation to Catalan with 5 paragraphs of unedited MT, a limit of 82.5% will be applied. 99% is the default limit, but a reduction of 1/6 of the initial limit is applied when there are problematic paragraphs not marked as resolved.
For a translation to Catalan with 5 paragraphs of unedited MT where the user marked all of them as resolved, a limit of 90.75% is applied since a reduction of 1/6 is applied having problematic paragraphs but it is compensated by a 1/12 increase for having all paragraphs marked as resolved.
For a translation to Indonesian (where the default limit is 70%) with 5 paragraphs of unedited MT where the user marked 3 of them as resolved, a limit of 61.8% will be applied. Since the Indonesian limit of 70% gets a 1/6 reduction and only 3/5 of the compensation for marking 3 out of 5 paragraphs marked as resolved.
For a translation to Indonesian with 5 paragraphs of unedited MT marking all of them as resolved but the user had a previously deleted translation the last week, a limit of 40.8% will be applied since an additional 1/3 penalization is applied because of previously deleted translations.
These examples can be used as test cases to verify the approach was correctly implemented.