Page MenuHomePhabricator

Enable MinT support for languages with no Wikipedia yet
Closed, ResolvedPublic

Description

MinT supports languages for which there is no Wikipedia yet. Although MinT won't be immediately usable in Wikipedia, those communities can still use the service to translate other content to their languages.

These are the selected languages to enable based on NLLB-200 supported languages:

  1. Mesopotamian Arabic (acm) ✅
  2. Ta’izzi-Adeni Arabic (acq) ✅
  3. Tunisian Arabic (aeb) ✅
  4. South Levantine Arabic (ajp) ✅
  5. North Levantine Arabic (apc) ✅
  6. Najdi Arabic (ars) ✅
  7. Bemba (bem) ✅
  8. Chokwe (cjk) ✅
  9. Dyula (dyu) ✅
  10. Chhattisgarhi (hne) ✅
  11. Jingpho (kac) ✅
  12. Kamba (kam) ✅
  13. Central Kanuri (knc) Enable support for Central Kanuri (knc) and the more general Kanuri (kr) language
  14. Kabuverdianu (kea) ✅
  15. Kimbundu (kmb) ✅
  16. Luba-Kasai (lua) ✅
  17. Luo (luo) ✅
  18. Mizo (lus) ✅
  19. Magahi (mag) ✅
  20. Mossi (mos) ✅
  21. Nuer (nus) ✅
  22. Tamasheq (taq) ✅
  23. Central Atlas Tamazight (tzm) ✅
  24. Umbundu (umb) ✅
  25. Fon (fon) ✅ Fon has graduated from incubator and there is a Wikipedia now

The IndicTrans2 model also supports the following languages without a Wikipedia (enabled as part of T337656):

  • Bodo (brx/brx_Deva) – ✅
  • Dogri (doi/doi_Deva) – ✅

__
See also T89089: Make ContentTranslation work in the Wikimedia Incubator

Event Timeline

Pginer-WMF triaged this task as Medium priority.May 15 2023, 3:03 PM

Note Central Kanuri is one of major dialect of Kanuri (kr), which has a Wikipedia, but was closed.

Note Central Kanuri is one of major dialect of Kanuri (kr), which has a Wikipedia, but was closed.

Thanks for the info @Bugreporter. For cases such as this, where the NLLB-200 model indicates the specific variant of the language, it is a bit unclear which code to use if there is no Wikipedia for either the more general language code or the specific variant. In this particular case, I guess it makes sense to support Kanuri with the kr code.

We may need to review the list to make sure whether other languages in the above list may be in a similar situation, and decide the best code (or codes) to use.

When comparing the list in the description with the language-data library, all the language codes are there except for Najdi Arabic (ars).

When comparing the list in the description with this localization directory, these seem to be the languages with most activity in translatewiki.net:

  • Mesopotamian Arabic (acm)
  • Mossi (mos)
  • Mizo (lus)
  • Magahi (mag)
  • Kanuri (kr), which can be supported with Central Kanuri (knc)
  • Central Atlas Tamazight (tzm)
  • Fon (fon)
  • Kabuverdianu (kea)

Change 948241 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] Enable Kanuri (kr/knc) language

https://gerrit.wikimedia.org/r/948241

Change 948242 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] Enable MinT support for languages with no Wikipedia yet

https://gerrit.wikimedia.org/r/948242

Change 948241 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Enable Kanuri (kr/knc) language

https://gerrit.wikimedia.org/r/948241

Change 948242 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Enable MinT support for languages with no Wikipedia yet

https://gerrit.wikimedia.org/r/948242

Change 949619 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2023-08-14-091804-production

https://gerrit.wikimedia.org/r/949619

Change 949619 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2023-08-14-091804-production

https://gerrit.wikimedia.org/r/949619

Mentioned in SAL (#wikimedia-operations) [2023-08-17T08:31:49Z] <kart_> Updated cxserver to 2023-08-14-091804-production (T336683, T343211)

Change 950063 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-08-14-091403-production

https://gerrit.wikimedia.org/r/950063

Change 950063 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-08-14-091403-production

https://gerrit.wikimedia.org/r/950063

Mentioned in SAL (#wikimedia-operations) [2023-08-21T06:28:57Z] <kart_> Update MinT to 2023-08-14-091403-production (T336683)

Testing these languages in this test instance I found that most worked as expected. However, some languages were not listed, and other languages were listed but they got stuck in the translation process without providing a response.

Not listed:

  • North Levantine Arabic (apc)
  • Najdi Arabic (ars)

Listed, but not responding:

  • Kanuri (kr)
  • Mizo (lus)
  • Tamasheq (taq)
KartikMistry changed the task status from Open to In Progress.Aug 28 2023, 11:43 AM

Change 952844 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] config: Fix language code for Mizo

https://gerrit.wikimedia.org/r/952844

Change 952845 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] config: Add missing language codes and enable taq language

https://gerrit.wikimedia.org/r/952845

Testing these languages in this test instance I found that most worked as expected. However, some languages were not listed, and other languages were listed but they got stuck in the translation process without providing a response.

Not listed:

  • North Levantine Arabic (apc)
  • Najdi Arabic (ars)

Listed, but not responding:

  • Kanuri (kr)
  • Mizo (lus)
  • Tamasheq (taq)

Thanks, Pau. I've submitted patches to fix the above issues, but I'm still checking about Kanuri (kr).

Change 952845 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] config: Add missing language codes and enable taq language

https://gerrit.wikimedia.org/r/952845

Change 952844 abandoned by KartikMistry:

[mediawiki/services/cxserver@master] config: Fix language code for Mizo

Reason:

Need to fix in the MinT.

https://gerrit.wikimedia.org/r/952844

Change 953754 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] config: Fix language code for Mizo

https://gerrit.wikimedia.org/r/953754

Change 953754 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] config: Fix language code for Mizo

https://gerrit.wikimedia.org/r/953754

Change 954005 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-08-31-061147-production

https://gerrit.wikimedia.org/r/954005

Change 954005 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-09-04-051105-production

https://gerrit.wikimedia.org/r/954005

Mentioned in SAL (#wikimedia-operations) [2023-09-05T05:55:09Z] <kart_> Updated MinT to 2023-09-04-051105-production (T336683)

Testing these languages in this test instance I found that most worked as expected. However, some languages were not listed, and other languages were listed but they got stuck in the translation process without providing a response.

Not listed:

  • North Levantine Arabic (apc)
  • Najdi Arabic (ars)

Listed, but not responding:

  • Kanuri (kr)
  • Mizo (lus)
  • Tamasheq (taq)

Thanks, Pau. I've submitted patches to fix the above issues, but I'm still checking about Kanuri (kr).

Thanks! I just checked, and most of the issues are resolved, the only one pending is for Kanuri (kr) being listed but not responding.

Change 965508 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] config: Enable Kanuri as kr language code too

https://gerrit.wikimedia.org/r/965508

Change 965508 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] config: Enable Kanuri as kr language code too

https://gerrit.wikimedia.org/r/965508

Change 966170 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-10-16-101614-production

https://gerrit.wikimedia.org/r/966170

Change 966170 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-10-16-101614-production

https://gerrit.wikimedia.org/r/966170

Thanks! I just checked, and most of the issues are resolved, the only one pending is for Kanuri (kr) being listed but not responding.

@Pginer-WMF This should be fixed now.

Thanks! I just checked, and most of the issues are resolved, the only one pending is for Kanuri (kr) being listed but not responding.

@Pginer-WMF This should be fixed now.

Perfect. It works now!

translate.wmcloud.org_text(Wiki Tablet).png (768×1 px, 83 KB)

I wish Igala language is part of the languages added here, though still incubator but machine in translation will be so helpful.

I wish Igala language is part of the languages added here, though still incubator but machine in translation will be so helpful.

In order to enable support Igala we need a machine translation model that supports the language. Unfortunately the models currently integrated in MinT do not seem to support Igala. There are some ways to help change that:

  • Find a freely licensed translation model that supports the language. In this way, we could integrate it in MinT.
  • Help expand the multilingual content available in Igala for future models to be creaed. This can be done by:
    • Using Content Translation to translate Wikipedia articles in Igala, once Igala Wikipedia graduates from Incubator (the tool is not available at the incubator stage).
    • Contacting the Tatoeba project to enable support for Igala, and provide translations of sentences.
    • Share any freely licensed multilingual data that includes Igala content with the opus project.