Page MenuHomePhabricator

Enable MT based on closely-related languages based on community input
Closed, ResolvedPublic

Description

Some languages lacking Machine Translation (MT) support may benefit from having access to the MT available for closely related languages. It may be preferred to correct the language differences rather than starting from scratch.

Based on input from their communities, this ticket proposes to enable MT for the following cases:

  • On Serbo-Croatian (sh) Wikipedia enable Bosnian (bs) MT with Google and Yandex (details)
  • On Wu (wuu) Wikipedia enable Chinese (zh) MT with Google, Yandex, LingoCloud and Youdao (T199523)
  • On Cantonese (zh-yue) Wikipedia enable the Traditional script variant of Chinese MT which is only available with Google (using code zh-TW).
  • On Gan (gan) Wikipedia enable the Traditional script variant of Chinese MT which is only available with Google (using code zh-TW).
  • On Belarusian Taraškievica (be-tarask) Enable Belarusian (be) MT with Google and Yandex.

These changes are similar to the way Simple English Wikipedia (T196354) had English MT is exposed.

Event Timeline

Pginer-WMF triaged this task as Medium priority.Jul 27 2020, 9:15 AM
Pginer-WMF updated the task description. (Show Details)

Change 618009 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (sh)

https://gerrit.wikimedia.org/r/618009

Change 618050 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)

https://gerrit.wikimedia.org/r/618050

Change 618176 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (gan and zh-yue)

https://gerrit.wikimedia.org/r/618176

Change 618228 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages based on community input (be-tarask)

https://gerrit.wikimedia.org/r/618228

Change 618009 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (sh)

https://gerrit.wikimedia.org/r/618009

Change 618050 merged by KartikMistry:
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (wuu)

https://gerrit.wikimedia.org/r/618050

Change 618176 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (gan and zh-yue)

https://gerrit.wikimedia.org/r/618176

Change 618228 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Enable MT based on closely-related languages (be-tarask)

https://gerrit.wikimedia.org/r/618228

Change 618525 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production

https://gerrit.wikimedia.org/r/618525

Change 618525 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2020-08-05-070016-production

https://gerrit.wikimedia.org/r/618525

Mentioned in SAL (#wikimedia-operations) [2020-08-06T12:06:33Z] <kart_> Updated cxserver to 2020-08-05-070016-production (T258919, T199523, T257943, T256194)

Jpita added a subscriber: Jpita.

For serbo-croatian, Apertium is available to use for MT but the translation ends up like this

image.png (579×1 px, 186 KB)

Maybe we should remove Apertium?

Also, I can't assure that this change is working as supposed because I don't speak the languages used so I don't know which language is the output of the MT

For serbo-croatian, Apertium is available to use for MT but the translation ends up like this
Maybe we should remove Apertium?

We can keep it. No need to remove it unless specifically asked by the community.

Also, I can't assure that this change is working as supposed because I don't speak the languages used so I don't know which language is the output of the MT

We deployed this for Google. Did that worked well? Only "MT is working" is sufficient to check as of now.

Also, I can't assure that this change is working as supposed because I don't speak the languages used so I don't know which language is the output of the MT

Some ways that may be useful for checking this:

  • Compare with the alternative language in Content Translation. For example, you can translate an article from English to Serbo-Croatian. Then translate the same article from English to Bosnian. Since Bosian MT is expected to be used in both cases the MT provided should be the same.
  • Use the MT service website. For example, you can translate an article from English to Serbo-Croatian using Google. then go to the Google Translate website and paste the english text to get the Bosnian translation, which should be the same as you got in Content Translation.

The above comparisons can be made by looking at the texts side by side, but for scripts you may be unfamiliar with an online diff tool may be useful.

Also, I can't assure that this change is working as supposed because I don't speak the languages used so I don't know which language is the output of the MT

Some ways that may be useful for checking this:

  • Compare with the alternative language in Content Translation. For example, you can translate an article from English to Serbo-Croatian. Then translate the same article from English to Bosnian. Since Bosian MT is expected to be used in both cases the MT provided should be the same.
  • Use the MT service website. For example, you can translate an article from English to Serbo-Croatian using Google. then go to the Google Translate website and paste the english text to get the Bosnian translation, which should be the same as you got in Content Translation.

The above comparisons can be made by looking at the texts side by side, but for scripts you may be unfamiliar with an online diff tool may be useful.

Excellent advice, thank you!