Page MenuHomePhabricator

Explore using IndicTrans2 - better model supporting 22 Indic languages
Closed, ResolvedPublic

Description

The https://github.com/AI4Bharat/IndicTrans2 project used larger corpus to train machine translation model for Indian languages. From a quick reading of code, it uses similar architecture of NLLB. From my testing using the demo site https://models.ai4bharat.org/#/nmt/v2 found that the results are better than NLLB. The grammar of sentences in translation looks better.

Since MinT supports multiple backend models. and IndicTrans2 looks like a compatible model, explore this opportunity.

The following languages are supported:

  1. Assamese (as/asm_Beng)
  2. Bangla (bn/ben_Beng)
  3. Bodo (brx/brx_Deva) No wiki yet
  4. Dogri (doi/doi_Deva) No wiki yet
  5. English (en/eng_Latn)
  6. Goan (gom/gom_Deva)
  7. Gujarati (gu/guj_Gujr)
  8. Hindi (hi/hin_Deva)
  9. Kannada (kn/kan_Knda)
  10. Kashmiri (ks/kas_Arab & kas_Deva)
  11. Maithili (mai/mai_Deva)
  12. Malayalam (ml/mal_Mlym)
  13. Manipuri (mni/mni_Beng & mni_Mtei)
  14. Marathi (mr/mar_Deva)
  15. Nepali (ne/npi_Deva)
  16. Oriya (or/ory_Orya)
  17. Panjabi (pa/pan_Guru)
  18. Sanskrit (sa/san_Deva)
  19. Santali (sat/sat_Olck)
  20. Sindhi (sd/snd_Arab & snd_Deva)
  21. Tamil (ta/tam_Taml)
  22. Telugu (te/tel_Telu)
  23. Urdu (ur/urd_Arab)

1asm
2ben
3hin
4kas
5sat
6gom
7guj
8kan
9mai
10mal
11mni
12mar
13npi
14ory
15pan
16san
17snd
18tam
19tel
20urd
21brx
22doi


Expanded capabilities to support translations for all combinations of Indic languages (not just from/to English) is covered in T352690: Evaluate the integration of the new IndicTrans model (IndicTrans2-M2M) into MinT

Event Timeline

Change 928008 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] Add IndicTrans2 support

https://gerrit.wikimedia.org/r/928008

As per https://github.com/AI4Bharat/IndicTrans2/issues/6 there is no indic to indic translation support in these models. The upstream demo uses indic->en->indic multi step translation to achieve indic->indic translation.

Change 928008 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Add IndicTrans2 support

https://gerrit.wikimedia.org/r/928008

Change 929438 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] indictrans2 performance improvements

https://gerrit.wikimedia.org/r/929438

Change 929439 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2023-06-12-125157-production

https://gerrit.wikimedia.org/r/929439

Change 929438 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] indictrans2 performance improvements

https://gerrit.wikimedia.org/r/929438

Change 929439 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2023-06-13-061519-production

https://gerrit.wikimedia.org/r/929439

Mentioned in SAL (#wikimedia-operations) [2023-06-13T07:09:24Z] <kart_> Updated MinT to 2023-06-13-061519-production (T337656, T334465)

Pginer-WMF subscribed.

After verifying languages supported by IndicTrans2 are working in the context of Content Translation in T339896, we can resolve this.