Page MenuHomePhabricator

MinT translates to English when Hindi-Santali or any other language-Santali is selected
Closed, ResolvedPublicBUG REPORT

Description

It was reported that when you start a translation from Hindi or Bengali to Santali, the Machine translation is in English instead of Santali.

Steps to replicate the issue (include links if applicable):

  • Select any language article except English and translate to Santali (ᱥᱟᱱᱛᱟᱲᱤ).
  • Select MinT.
  • The translation will be in English instead of Santali (Ol Chiki).

What happens?:
The translation will be in English instead of Santali (Ol Chiki).

What should have happened instead?:
It should be translated to Santali (Ol Chiki).

Event Timeline

Change 932681 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Remove Santali from NLLB base model as it need wrong script code

https://gerrit.wikimedia.org/r/932681

Change 932682 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] MinT: Reconfigure santali as en<->santali only

https://gerrit.wikimedia.org/r/932682

Change 932682 merged by jenkins-bot:

[mediawiki/services/cxserver@master] MinT: Reconfigure santali as en<->santali only

https://gerrit.wikimedia.org/r/932682

Change 932681 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Remove Santali from NLLB base model as it need wrong script code

https://gerrit.wikimedia.org/r/932681

Change 932683 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2023-06-26-050753-production

https://gerrit.wikimedia.org/r/932683

Change 932683 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2023-06-26-050753-production

https://gerrit.wikimedia.org/r/932683

@UOzurumba We have updated the config for Santhali. Currently, only English -> Santhali translation is supported by MinT.

@UOzurumba We have updated the config for Santhali. Currently, only English -> Santhali translation is supported by MinT.

Thank @KartikMistry for the information I will let the community know.

Change 933221 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-06-27-053706-production

https://gerrit.wikimedia.org/r/933221

Change 933221 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-06-27-053706-production

https://gerrit.wikimedia.org/r/933221

Mentioned in SAL (#wikimedia-operations) [2023-06-27T09:11:33Z] <kart_> Updated MinT to 2023-06-27-053706-production (T339896, T340236)

@UOzurumba We have updated the config for Santhali. Currently, only English -> Santhali translation is supported by MinT.

The current status is that only en-sat language pair is supported.
My understanding, is that in order to support other pairs, we need to translate the code used for Santali into the one expected by NLLB-200. Should this be part of the current ticket or a separate follow-up ticket is preferred?

My understanding, is that in order to support other pairs, we need to translate the code used for Santali into the one expected by NLLB-200. Should this be part of the current ticket or a separate follow-up ticket is preferred?

Translating between a non-english language and santali is supported by NLLB200 using sat_deva code(Wrong because it is not devanagari). But IndicTrans2 uses larger and better corpus for Santali and offeres en->sat or sat->en. I wont recommened using NLLB200 for other combinations because that require complicating the configuration to use two different language codes for two models. From my conversations to IndicTrans2 team, they are working on Indic->Indic translations.

My understanding, is that in order to support other pairs, we need to translate the code used for Santali into the one expected by NLLB-200. Should this be part of the current ticket or a separate follow-up ticket is preferred?

Translating between a non-english language and santali is supported by NLLB200 using sat_deva code(Wrong because it is not devanagari). But IndicTrans2 uses larger and better corpus for Santali and offeres en->sat or sat->en. I wont recommened using NLLB200 for other combinations because that require complicating the configuration to use two different language codes for two models. From my conversations to IndicTrans2 team, they are working on Indic->Indic translations.

Thanks for the context. I think it makes sense to wait, and check if the next developments of IndicTrans2 covers the needs of the community. At that point, if there are some needs not covered (e.g., translations from French to Santali being common) we can reconsider how to expose NLLB-200, but I'd expect most translations to be from English or other Indic languages. So resolving this for now.