Page MenuHomePhabricator

CX2: Tracking categories have frequent false positives
Closed, ResolvedPublic

Description

Translations that are published with paragraphs containing too much unmodified content (80% of Machine translation or 60% of source content) are published in a tracking category for the community to review (T211763, T190798).

Tracking categories for Italian and German Wikipedias include several articles. Since there is no machine translation for these languages, one would expect these articles to be published with some content in the source language. However inspecting the initial version of those articles they don't seem to have visible unmodified content (example).

We need to investigate why these false positives are happening (is user typed content counted as unmodified? are some wikitext elements counted as unmodified?..) and fix the issue to make sure that the content in those tacking categories is relevant.

Details

Related Gerrit Patches:
mediawiki/extensions/ContentTranslation : masterExclude references list from MT abuse checking
mediawiki/extensions/ContentTranslation : masterUse source section as unmodified MT while restoring sections
mediawiki/extensions/ContentTranslation : masterSkip issues marked as resolved when searching for MT abuse issues

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2019, 9:18 AM
Pginer-WMF triaged this task as High priority.Feb 8 2019, 9:19 AM
Pginer-WMF moved this task from Needs Triage to CX2 on the ContentTranslation board.
Elitre added a subscriber: Elitre.Feb 8 2019, 11:28 AM

The tracking category on French Wikipedia seems that also accumulates articles that were apparently published without issues.

We need to investigate why these false positives are happening (is user typed content counted as unmodified?

At some point, the content typed/copied&pasted by a user was not recognized as a modified content. The issue was addressed and publishing was successful for such cases, but whether the category "Pages with unreviewed machine translation" is applied or not was not checked.

Change 490533 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Use source section as unmodified MT while restoring sections

https://gerrit.wikimedia.org/r/490533

Since there is no machine translation for these languages, "Copy original content" is used as the default option.
Determining how much content has changed works the same for "Copy original content" as for any other MT option, but problem lies in how we store unmodified MT in parallel corpora and later restore the drafts. When Apertium is used, addition of every paragraph creates two parallel corpora entries: one for unmodified Apertium translation and the second for source paragraph. When user makes any change, third entry is created for user's translation. Same applies for other MT engines as well. When "Start with an empty paragraph" is used, three records are saved as well, for source, user and scratch (unmodified MT).
However, when "Copy original content" is used, only source paragraph and user translation are saved in parallel corpora. This is not a problem in the initial translation session, as content adapted from source is considered as unmodified MT in the first session. But, when translation is restored, paragraph saved as user translation in parallel corpora gets treated as unmodified MT, so when section is restored and compared to that unmodified MT, it will always be treated as 100% MT, because those two are the same.

For this reason, I think the majority of these issues were caused by the improper treatment of data when draft is loaded again. The user is likely prompted with a warning (T190036) and has the option to mark all the warnings as resolved (T198188), but this needs to be fixed.

After this particular problem is fixed, there may need to be some further investigations if there are some other causes of false positives.

Change 494206 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Skip issues marked as resolved when searching for MT abuse issues

https://gerrit.wikimedia.org/r/494206

Thanks to @Nikerabbit's comments, I found another very possible reason that tracking categories includes articles that don't show signs of MT unmodified text. MT abuse warnings that are not resolved by editing, but only marked as resolved, are included in search for any section that has MT abuse warning. Since we only need one MT abuse warning in any section to publish article under the tracking category, this can be another likely reason behind the false positives.

Thanks to @Nikerabbit's comments, I found another very possible reason that tracking categories includes articles that don't show signs of MT unmodified text. MT abuse warnings that are not resolved by editing, but only marked as resolved, are included in search for any section that has MT abuse warning. Since we only need one MT abuse warning in any section to publish article under the tracking category, this can be another likely reason behind the false positives.

I'm not sure I understand the above. The proposed criteria for including a published translation in the unreviewed category is for translations with at least a paragraph where the content has not been modified more than the given threshold. This is not affected by whether the user marked the associated warning as resolved or not.

This is one aspect among others that we can change/adjust in our criteria if we find it problematic, but it would require to discuss the implications first.

Thanks to @Nikerabbit's comments, I found another very possible reason that tracking categories includes articles that don't show signs of MT unmodified text. MT abuse warnings that are not resolved by editing, but only marked as resolved, are included in search for any section that has MT abuse warning. Since we only need one MT abuse warning in any section to publish article under the tracking category, this can be another likely reason behind the false positives.

I'm not sure I understand the above. The proposed criteria for including a published translation in the unreviewed category is for translations with at least a paragraph where the content has not been modified more than the given threshold. This is not affected by whether the user marked the associated warning as resolved or not.
This is one aspect among others that we can change/adjust in our criteria if we find it problematic, but it would require to discuss the implications first.

I thought the option to "Mark as resolved" will help the users to get rid of the warnings which stick even after they made enough changes, because the threshold cannot be perfect for all the cases. If we ignore user's decision to "Mark as resolved", adding to tracking category might come very unexpected. The users might not realize their article was added to the tracking category in a majority of cases though, as it's more for users patrolling to keep the quality high.
Also, allowing to mark as resolved and publish the article without any tracking category gives the opportunity to users to suppress all the warnings and their article ends up with lots of MT and out of the tracking category.
The question is do we want to respect user's choice because threshold isn't perfect on the cost of possible abuse.

The question is do we want to respect user's choice because threshold isn't perfect on the cost of possible abuse.

Thanks, @Petar.petkovic. That's a good analysis and a relevant question. We want the tracking category to be useful: not too broad to include too many false positives, not too narrow or including holes for problematic articles to be skipped.

My preference is to first try to adjust the thresholds and the algorithms for checking the content automatically in such a way that the articles flagged are very likely to be problematic. We can consider to increase the number of unreviewed sections that we consider for adding the page to the category (currently 1), adjusting the algorithm (if we think that user changes are not measured precisely enough), or adjust the thresholds. Do you think there is any room for improvement in that regard based on your analysis?

Next, we can consider to factor-in user feedback. But we need to be careful. If reviewers found an article that was heavily unreviewed and is not in the category (because the user blindly marked everything as reviewed to get rid of the warnings, then the reviewer will perceive the category as useless). My proposal would be not to trust blindly the user marking the issue as resolve but considering it for the threshold to apply. For example, if the user marked a section for review, consider it as unreview only if it still contains 95% or more of unmodified content (instead for the current 80% threshold). I think that could improve the results, while still preventing the most problematic cases. Does it sound like a god compromise?

The question is do we want to respect user's choice because threshold isn't perfect on the cost of possible abuse.

My preference is to first try to adjust the thresholds and the algorithms for checking the content automatically in such a way that the articles flagged are very likely to be problematic. We can consider to increase the number of unreviewed sections that we consider for adding the page to the category (currently 1), adjusting the algorithm (if we think that user changes are not measured precisely enough), or adjust the thresholds. Do you think there is any room for improvement in that regard based on your analysis?

The algorithm can be considered as good enough. It compares the unchanged tokens (words) to measure how much the content is unmodified. We already adjusted the thresholds, and there may need to be some deeper analysis on how well it performs, before we decide to change it again. The proposal to increase the number of sections that have MT abuse warning sounds good. But the proposal you had below is the best.

Next, we can consider to factor-in user feedback. But we need to be careful. If reviewers found an article that was heavily unreviewed and is not in the category (because the user blindly marked everything as reviewed to get rid of the warnings, then the reviewer will perceive the category as useless). My proposal would be not to trust blindly the user marking the issue as resolve but considering it for the threshold to apply. For example, if the user marked a section for review, consider it as unreview only if it still contains 95% or more of unmodified content (instead for the current 80% threshold). I think that could improve the results, while still preventing the most problematic cases. Does it sound like a good compromise?

I think the proposal you have in this paragraph is better than three you had in the first paragraph. We can use 95% mark as the threshold for sections that have warning marked as resolved.

Ok. This ticket was mainly focused on the case of modifications of source content to be properly calculated. That is, for languages lacking MT to verify that the "unreviewed" articles to the tracking category have sections where content in the source language is visible. I would avoid expanding the scope for now*, and propose to verify if there have been an improvement in that scope with the current code changes. Is there a way to verify this? Can we re-check if some of the articles in the Italian or German tracking categories that do not contain any content in the source language for the paragraphs CX flagged, would be properly classified now?

* I also created a separate ticket to improve the general approach base on the above comments: T217653: CX2: Improve the approach used for adding translations to the "unreviewed content" tracking category

Is there a way to verify this? Can we re-check if some of the articles in the Italian or German tracking categories that do not contain any content in the source language for the paragraphs CX flagged, would be properly classified now?

Notice that no patch is merged for this ticket yet. Both are in review, but 494206 will be abandoned as per above discussions.

When 490533 is merged, we can recheck Italian or German tracking categories, or rely on (lack of) user reports. We can also test many use-cases while doing the QA, but what happens in production wikis will be the only reliable source of truth.

Change 494206 abandoned by Petar.petkovic:
Skip issues marked as resolved when searching for MT abuse issues

https://gerrit.wikimedia.org/r/494206

Change 490533 merged by Petar.petkovic:
[mediawiki/extensions/ContentTranslation@master] Use source section as unmodified MT while restoring sections

https://gerrit.wikimedia.org/r/490533

Change 499676 had a related patch set uploaded (by Petar.petkovic; owner: Petar.petkovic):
[mediawiki/extensions/ContentTranslation@master] Exclude references list from MT abuse checking

https://gerrit.wikimedia.org/r/499676

References list is also checked for MT abuse. If user corrected all the sections with content not to show MT abuse warning and references list is the only section with this warning, since only one warning is enough, the article will be published under the tracking category.

That case could be considered as a false positive.

References list is also checked for MT abuse. If user corrected all the sections with content not to show MT abuse warning and references list is the only section with this warning, since only one warning is enough, the article will be published under the tracking category.
That case could be considered as a false positive.

Good finding. Thanks, @Petar.petkovic.
I think it makes sense to ignore the references section (and references in general) in this regard.

Change 499676 merged by Nikerabbit:
[mediawiki/extensions/ContentTranslation@master] Exclude references list from MT abuse checking

https://gerrit.wikimedia.org/r/499676

Change 499676 merged by Nikerabbit:
[mediawiki/extensions/ContentTranslation@master] Exclude references list from MT abuse checking
https://gerrit.wikimedia.org/r/499676

For this part, I added a comment on T219468#5096970 to make sure this is verified as part of the QA for that ticket.

If specific parts of this ticket are implemented, it may be better to create sub-tickets so that they can move forward independently rather than move the whole task to QA. Otherwise it is hard to identify and pick the remaining parts to make other improvements in the area. Feel free to create sub-tickets of this task when needed.

@Petar.petkovic are you planning to continue working on this beyond the work already done on the exclusion of references list from consideration as unmodified content?

@Petar.petkovic are you planning to continue working on this beyond the work already done on the exclusion of references list from consideration as unmodified content?

No, this ticket can be considered as done or go to QA (not sure why you moved it to In-Progress column). If something else is affecting tracking category false positives, separate tickets can be created, like T217653.

Pginer-WMF closed this task as Resolved.Apr 11 2019, 4:43 PM

@Petar.petkovic are you planning to continue working on this beyond the work already done on the exclusion of references list from consideration as unmodified content?

No, this ticket can be considered as done or go to QA (not sure why you moved it to In-Progress column). If something else is affecting tracking category false positives, separate tickets can be created, like T217653.

You are right. I acted on this thinking it was T217653. Moving to done since the QA checks can be done as part of T219468, and the follow-up work as part of T217653. Sorry for the confusion!