Page MenuHomePhabricator

Expose Machine Translation services supporting Chinese to closer languages/variants
Closed, ResolvedPublic

Description

Machine translation services supporting Chinese (zh), seem to provide content in "Mandarin Chinese". Similar non-Mandarin Chinese languages (NMCL) such as Cantonese (zh-yue) and Wu (wuu) for which there is no machine translation support, may benefit from having access to the Mandarin Chinese machine translation.

Languages that can benefit from surfacing "Mandarin Chinese" MT:

codeEng. namelocal name
zh-yueCantonese粵語
wuuWu吴语
ganGan贛語

For yuewp and ganwp, in addition to expose the Mandarin Chinese machine translation, it would be useful to convert the Simplified Chinese characters from the machine translation into Traditional Chinese characters.

This change is similar to the way English translations were exposed for Simple English (T196354), but we may need to have some additional considerations:

  • Identify the list of languages that can benefit from having Chinese.
  • Determine whether there is a potential risk of unreviewed machine translation to propagate to the translation in these cases. Some of the affected wikis may have filters to block Mandarin contents. We may need to check how those errors are communicated in Content Translation.
  • Determine how to communicate the machine translation support to users. Users may get surprised to see the translation into a different language. Even if it is close to the expected one, this can be perceived as an error of the tool. We can show the language name next to the service used as a clarification.
  • Remove support when translation services provide specific support for the exact language in the future.

This was based on the feedback provided by a translator.

Event Timeline

Pginer-WMF moved this task from Needs Triage to MT on the ContentTranslation board.
Liuxinyu970226 added a subscriber: Liuxinyu970226.

By reading that topic, I believe that there are zhwiki benefits affected

Thank you very much for creating this ticket, @Pginer-WMF .
I have just created two community discussions on yuewp and wuuwp, still waiting for response. Activity on ganwp is virtually down to zero.

  • languages that can benefit
codeEng. namelocal name
zh-yueCantonese粵語
wuuWu吴语
ganGan贛語
  • languages that may not benefit, simply because the first three are not written in the Chinese script but in Latin alphabet, and classical Chinese deviates too far from modern Mandarin.
cdoMin Dong
zh-min-nanMin Nan
hakHakka
zh-classicalClassical Chinese

(Side note: Several others are still in the incubator.)

  • potential risk....

I am confused by this part. Does this refer to using CT to spam the target wikis? There are active sysops and other users patrolling yuewp and wuuwp.
YUEWP. We have filters to block Mandarin contents. So far not that many people have actually used CT to translate stuff to yuewp. And spammers could actually post untranslated contents right now. Also, policies are such that pages left unfinished in Mandarin would be deleted just like pages written in other languages. As such, I don't think spamming with Mandarin contents would be a big problem on yuewp.

  • Notice/Warning

this is tricky.... I can think of two solutions.

  1. Maybe CT can have some sort of banners/highlights, on the sidebar or anywhere, to warn the users of 'default: Mandarin in place, please work on MT-contents to make sure they conform to styles and standards on the target wiki'
  2. We could put up permanent banners on yuewp and wuuwp, to provide guidelines to users.
  • End of this provisional measure

When support for yue, wuu and others become available, and the quality of MT contents is satisfactory, then CT should switch to using the new support. (I hope it would not involve too much work, but there might be other concern. See below.)
I suppose Cantonese is the most hopeful to be supported next, but still it's quite distant. Currently, Bing.com and Baidu.com support yue (though their mechanism is exactly translating stuff into Mandarin and then doing some dummie word-to-word conversion to yue). Google announced a few years ago that it was working on yue.

other concern

  1. yuewp has another long-standing problem with the language code, see T10217 . wuu and gan don't have this problem. (I guess solution of this is even more distant than translation support for yue. crylaugh)
  2. yuewp and ganwp pages must be written in Traditional Chinese characters. wuuwp doesn't specify which to use, but most contents are written in Simplified Chinese characters. So, if possible, please consider including support for this minor difference. This is not a big issue as it can be easily resolved by users using browser extensions, JS or whatever.
  3. If the source is zhwp, would this (translating chinese to chinese) be a potential bug and break the server? If so, CT should just copy the original content in this case.

Thanks for all the details provided, @Roy17. Some comments below:

  • potential risk....

I am confused by this part. Does this refer to using CT to spam the target wikis? There are active sysops and other users patrolling yuewp and wuuwp.

I was not thinking of intentional spam, but more of increasing the chances of some contents to accidentally going without review. In any case, we are improving the system to measure how much content is reviewed and warn users accordingly. What I'd propose is to monitor the articles created and provide us with feedback, so that the thresholds for the warning mechanisms can be adjusted if needed.

If there are filters to block Mandarin contents, we need to check how the errors are surfaced in Content Translation, to make sure we communicate clearly to the user what needs to be reviewed.

  • Notice/Warning

this is tricky.... I can think of two solutions.

  1. Maybe CT can have some sort of banners/highlights, on the sidebar or anywhere, to warn the users of 'default: Mandarin in place, please work on MT-contents to make sure they conform to styles and standards on the target wiki'
  2. We could put up permanent banners on yuewp and wuuwp, to provide guidelines to users.

I was thinking along the lines of 1, that is, how to communicate it inside Content Translation. I think that we can convey the information in the "Automatic translation" card. For example, showing the language name next to the service used (e.g., showing "Using Yandex (Chinese)". This does not seem a complex change, but my point is that we need to discuss design options and implement it as part of this work (unlike the case of Simple English which was, well, simpler).

  1. yuewp and ganwp pages must be written in Traditional Chinese characters. wuuwp doesn't specify which to use, but most contents are written in Simplified Chinese characters. So, if possible, please consider including support for this minor difference. This is not a big issue as it can be easily resolved by users using browser extensions, JS or whatever.

So ideally, for yuewp and ganwp, in addition to expose the Mandarin Chinese machine translation, it would be useful to convert the Simplified Chinese characters from the machine translation into Traditional Chinese characters. I'll add this to the ticket description.

  1. If the source is zhwp, would this (translating chinese to chinese) be a potential bug and break the server? If so, CT should just copy the original content in this case.

Good consideration. I don't think it would break, but it is a good case to check, and definitely avoid unnecessary use of resources.

In the past few months I used CT to convert zhwp articles for around 100 times. I remember there were occasions that CT showed a banner on the top right hand corner, warning me that the translation would activate a certain filter on YUEWP.

Two most important filters we put up on YUEWP to detect Mandarin contents are filter 4 and 5. They basically identify some fundamentally different structural Mandarin words like conj., prep.

I am not a linguist but I'd try summarising the differences between Mandarin and Cantonese. The two share a large set of vocabulary, especially proper nouns and concepts that come from the West. Not 100% identical but maybe 70%. The major differences lie in the grammar and those words used for grammatical structures. e.g. be 是 係, at 在 喺, he 他 佢. In each case the first character is Mandarin and the other is Cantonese. You could see these in the filters.

That's why I brought forth this proposal, in hopes that users could save the time translating the large bulk of n. v. adj. adv. etc, but only focus on arranging content, grammar and style. Additionally, we prefer translating original sources from European languages, instead of the Mandarin translation we might find on zhwp, because zhwp contents are often outdated and written in a not so encyclopaedic style.

I agree with the comments above that until a separate machine translation engine is available for Cantonese, it can be useful to display a machine translation from a non-Sinosphere language to Modern Standard Written Chinese (MSWC) when the target language is Cantonese. This is because Cantonese and Mandarin share the vast majority of technical vocabulary; the lexical similarity rises when one considers MSWC which actually incorporates significant amounts of Cantonese vocabulary. In addition, the fact that most existing machine translation engines don't distinguish between the different kinds of Chinese languages means that the machine translations themselves are somewhat influenced by Cantonese anyway (being treated as a subset of Chinese rather than a standalone language).

As discussed above, the fact that unedited machine translation is generally awful and the Cantonese Wikipedia already has edit filters to post unedited Mandarin-grammar content, means that there is already some safeguard against CT being used to flood the Cantonese Wikipedia with machine-translated Mandarin content.

Change 618050 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)

https://gerrit.wikimedia.org/r/618050

Change 618050 merged by KartikMistry:
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)

https://gerrit.wikimedia.org/r/618050

Change 618525 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production

https://gerrit.wikimedia.org/r/618525

Change 618525 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production

https://gerrit.wikimedia.org/r/618525

As part of T258919, we supported the following that is relevant to this ticket:

  • On Wu (wuu) Wikipedia enable Chinese (zh) MT with Google, Yandex, LingoCloud and Youdao (T199523)
  • On Cantonese (zh-yue) Wikipedia enable the Traditional script variant of Chinese MT which is only available with Google (using code zh-TW).
  • On Gan (gan) Wikipedia enable the Traditional script variant of Chinese MT which is only available with Google (using code zh-TW).

This is based on the input that was provided in this ticket, so feel free to give it a try and let me know f further adjustments are needed.