Page MenuHomePhabricator

Limit publication of translations copied outside Content Translation
Open, MediumPublic

Description

Vietnamese editors have reported issues with a high volume of low quality translations that were created by copying the contents out of Content translation and published as new articles. Although the translations started with Content Translation were prevented from being published by the tool, users were able to copy their contents and paste them as a new article. This seems to happen especially around the time of contests, but is not limited to it.

This issue lead the community to make proposals for limiting the access to the Content Translation tool to allow only experienced users (T299636) or blocking the copy functionality of the tool (T298366). Even if more drastic measures are taken by this particular community, it may be worth exploring a more general approach to help any community experiencing this kind of issue.

The proposal

This ticket proposes to explore an approach to identify content pasted form Content Translation to prevent publishing in other editing tools. This could help with the above issue without affecting negatively those making a good use of the tool (e.g., a less experienced user making a good translation or copying content inside the tool in a meaningful way).

The proposal is the following:

  • Adjust the copy mechanism in Content Translation to insert tags/attributes in the copied content that indicate the origin of such content.
  • Adjust the publish process on editing tools such as Visual Editor to remove such tags when publishing the content and include a specific edit tag that identifies an edit as containing data copied from Content Translation.
  • The above allows communities to create an edit filter to block the publication of such content (or add the article to a special category, etc.).

In this way, a user copying content from Content Translation and pasting it into Visual Editor won't be able to publish such contents. Even if a fraction of the users find out about this mechanism, reviewing and removing the tags may slow down the user enough for not being worth it.

Additional considerations

  • Origin of the problematic translations. The problematic translations were not tagged as being created with Content Translation since they were created as new articles. However, they had traces of characteristic bugs of the tool such as duplicate category prefixes (T264490) or extra spaces added by Google (T220864). So it seems reasonable to think that the original translations were obtained by using Content Translation and not other tools or scripts.
  • Instrumentation. As an initial step we may want to instrument how often content is copied in Content Translation.This could help to identify unusually high copying activity in some communities.
  • Avoiding side-effects. We need to make sure that this does not introduced unexpected markup when publishing (T111155) or when using copy and paste inside Content Translation itself.
  • Granularity for tagging. When marking contents copied outside of the Content Translation tool, those can be marked differently depending on how much the initial machine translation was modified. This could allow communities to block from publishing content that is almost unmodified machine translation while contents the user edited more could be published and added to a category for further review. This is probably something not to be supported on the first iteration but to illustrate the flexibility of the approach and future improvements.
  • Potential workarounds. Given that this is an open source tool, any measure we propose may have a workaround. So while not being 100% perfect, we expect that by requiring more effort to bypass the system compared to fixing the translation we encourage those users to either make a positive contribution or abandon.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Feb 9 2022, 1:12 PM
Pginer-WMF created this task.
santhosh added a subscriber: santhosh.

The content copied from CX editor is better than the content copied from Google translate or Chrome browser's web page translation because CX content will have proper links and references. VE will retain them. That is the big incentive of using CX as source of copying. No matter how strict we add our filtering and tooling, copying from google translate and such places continue to be possible.

Technically, implementing this kind of CX signature in clipboard content is possible. Infact, VE already adds a few attributes to clipboard content and do a sophisticated processing before content is pasted to a VE edit surface. Detecting CX signature in it and adding a new change tag is also possible. They are not trivial and we would want to keep VE team in loop.

The extra tags or attributes we add are not visible to users and cannot be removed by them by any means. They are only present in content in cliboard. So it does not add any extra effort to translators. They will either see the edit reverted later or immediately when publishing articles(depending on how filters are designed). The filters should also prevent publishing to all namespaces.

I am skeptical about the effort required and its result though. Currently vi.wikipedia.org prevents most of editors to access the CX dashboard and redirects to their discussion page about CX. They have asked to disable CX for non-extended users too and unlikely to wait for development of this kind of tooling and experiment it. So we are talking about a future wiki community other than Vietnamese who might face this kind of abuse of CX and community reaction. Effectively, this solution does not prevent any copy pasting, but tags such suspicious edits. This allows community members to filter them using their own rules. Note that the viwiki was not interested in this extra work of filtering good edits or bad edits by going through the tags since it adds more work to them.

It is not clear whether we want to enable this tagging in all wiki's or certain wikis. If we don't enable in all possible wikis, there would be wikis like testwiki where one can do translation and copy content without the tags. By that logic, one can always publish to a wiki without this abusefilter(to userpage or sandbox ) and then take that content to actual wiki. One characteristics of this kind of abuse is software solutions will be always behind the creative ways an abuser can try. And the effort spend on building such software grows and often take the time and resource to build other kind of features.

To me it looks like a social problem. Why would one want to publish low quality content to a wiki knowingly by bypassing all quality checks? Because there is an incentive - campaigns and such competitions. If that incentive process is explicit about quality of articles, won't that reduce the issue? Any competitions that does not take care of abuse angles is problematic too. CX provides APIs if any campaign organizers wants to check the MT percentage, and usage if such sophisticated evaluation of contributions are required(we could do better job by providing tooling, but different topic).

If a major portion of contribution comes from this kind of abuse to a wiki, it is better to disable the tool itself in that wiki than adding more sophistication. It may be a sad decision for the people behind the tool, but it is not possible to have a tool working well with all languages and their community practices.

@santhosh First of all, why Vietnamese Wikipedia's request is being ignored? Each Wiki should decide a solution for itself depending on how severe the CT problem is.

Adding tag is useless. There is always a cheat way. For example, first copy from CT and paste to VE. Second, click "edit source". Third, copy the whole source. Fourth, open the new tab and try to create the article again without using CT or VE. Fifth, paste the source in and publish it.

Yes, competition organizer creates problematic competition that has room for abuse (whoever creates the most number of articles win; they do not check for the quality). Many of the winners are caught-cheaters on Vi Wikipedia. There is nothing we can do to stop them from organizing the competition. [https://baoquocte.vn/dai-su-quan-ba-lan-tai-viet-nam-cong-bo-cuoc-thi-wikipedia-lan-thu-ba-141320.html&mobile=yes&amp=1 Read more about it here]. Polish embassy has organized this competition 3 years in the row already since 2019 as a way to PR their culture and country. There are many abusers because the prize money is big compared to the median income in Vietnam. The fourth one is coming this year.

CT abuse does not limit to competition. It happens all year around during competition and without it. It's a very serious issue that has been running rampant for years in Vi Wikipedia. Therefore, the community has decided to take measure that has been proven to be effective.