Vietnamese editors have reported issues with a high volume of low quality translations that were created by copying the contents out of Content translation and published as new articles. Although the translations started with Content Translation were prevented from being published by the tool, users were able to copy their contents and paste them as a new article. This seems to happen especially around the time of contests, but is not limited to it.
This issue lead the community to make proposals for limiting the access to the Content Translation tool to allow only experienced users (T299636) or blocking the copy functionality of the tool (T298366). Even if more drastic measures are taken by this particular community, it may be worth exploring a more general approach to help any community experiencing this kind of issue.
The proposal
This ticket proposes to explore an approach to identify content pasted form Content Translation to prevent publishing in other editing tools. This could help with the above issue without affecting negatively those making a good use of the tool (e.g., a less experienced user making a good translation or copying content inside the tool in a meaningful way).
The proposal is the following:
- Adjust the copy mechanism in Content Translation to insert tags/attributes in the copied content that indicate the origin of such content.
- Adjust the publish process on editing tools such as Visual Editor to remove such tags when publishing the content and include a specific edit tag that identifies an edit as containing data copied from Content Translation.
- The above allows communities to create an edit filter to block the publication of such content (or add the article to a special category, etc.).
In this way, a user copying content from Content Translation and pasting it into Visual Editor won't be able to publish such contents. Even if a fraction of the users find out about this mechanism, reviewing and removing the tags may slow down the user enough for not being worth it.
Additional considerations
- Origin of the problematic translations. The problematic translations were not tagged as being created with Content Translation since they were created as new articles. However, they had traces of characteristic bugs of the tool such as duplicate category prefixes (T264490) or extra spaces added by Google (T220864). So it seems reasonable to think that the original translations were obtained by using Content Translation and not other tools or scripts.
- Instrumentation. As an initial step we may want to instrument how often content is copied in Content Translation.This could help to identify unusually high copying activity in some communities.
- Avoiding side-effects. We need to make sure that this does not introduced unexpected markup when publishing (T111155) or when using copy and paste inside Content Translation itself.
- Granularity for tagging. When marking contents copied outside of the Content Translation tool, those can be marked differently depending on how much the initial machine translation was modified. This could allow communities to block from publishing content that is almost unmodified machine translation while contents the user edited more could be published and added to a category for further review. This is probably something not to be supported on the first iteration but to illustrate the flexibility of the approach and future improvements.
- Potential workarounds. Given that this is an open source tool, any measure we propose may have a workaround. So while not being 100% perfect, we expect that by requiring more effort to bypass the system compared to fixing the translation we encourage those users to either make a positive contribution or abandon.