Page MenuHomePhabricator

Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services
Open, In Progress, MediumPublic

Description

The MADLAD-400 open source translation model supports many languages. Initial testing suggests quality may not be always very high, but can be still useful for those languages not supported by other services. Community input can help to identify when it is useful.

This ticket proposes to enable the MADLAD-400 model in the MinT test instance for communities to be able to try it. We selected Wikipedia languages which are not supported by any other translaiton service.

These are the languages selected:

(In bold languages with Content and Section translation enablement planned in T353510)

  1. Arpitan (frp)
  2. Kabardian (kbd)
  3. Moksha (mdf)
  4. Gorontalo (gor)
  5. Avar (av)
  6. Komi-Permyak (koi)
  7. Chechen (ce)
  8. Erzya (myv)
  9. Adyghe (ady)
  10. Newari (new)
  11. Kalmyk (xal)
  12. Jamaican Creole English (jam)
  13. Mon (mnw)
  14. Fiji Hindi (hif)
  15. Komi (kv)
  16. Tulu (tcy)
  17. Pampanga (pam)
  18. Tetum (tet)
  19. Karachay-Balkar (krc)
  20. Chamorro (ch)
  21. Gagauz (gag)
  22. Old English (ang)
  23. Aragonese (an) Apertium supports some pairs, MADLAD can provide support for other source languages
  24. Bavarian (bar)
  25. Bislama (bi) OpusMT supports translations form English, MADLAD can provide support for other source languages
  26. Cree (cr) cr_Latn code used by MADLAD-400!
  27. Manx (gv)
  28. Inuktitut (iu)
  29. Mirandese (mwl)
  30. Nan (nan) ! zh-min-nan code used in Wikipedia. nan_Latn_TW code used by MADLAD-400!
  31. Low German (nds)
  32. Low Saxon (nds-nl) nds_NL code used by MADLAD-400!
  33. Ossetic (os)
  34. Saraiki (skr)
  35. Sranan Tongo (srn)
  36. Tuvinian (tyv)
  37. Venda (ve) OpusMT supports translations form English, MADLAD can provide support for other source languages
  38. Wu Chinese (wuu)
  39. Moroccan Arabic (ary) Community objected to MinT using MADLAD-400 MADLAD-400 is not providing the right variant according to T339926
  40. Breton (br) Community objected to MinT using MADLAD-400 Apertium supports some pairs, MADLAD can provide support for other source languages
  41. Ido (io) Community objected to MinT using MADLAD-400
  42. Kara-Kalpak (kaa) Community objected to MinT using MADLAD-400
  43. Cornish (kw) Community objected to MinT using MADLAD-400
  44. Madurese (mad) Community objected to MinT using MADLAD-400
  45. Nias (nia) Community objected to MinT using MADLAD-400
  46. Serbo-Croatian (sh) Community objected to MinT using MADLAD-400
  47. Simple English (simple) Community objected to MinT using MADLAD-400
  48. Talysh (tly) Community objected to MinT using MADLAD-400 tly_IR code used by MADLAD-400, but Wikipedia seems to use the latin script instead.
  49. Walloon (wa) Community objected to MinT using MADLAD-400
  50. Cantonese (yue) zh-yue code used in Wikipedia. As per T354666#9593836, MADLAD-400 has the same issues as other models not providing the right variant for the language ( T333835).
  51. Romansh (rm) Community objected to MinT using MADLAD-400
  52. Saterland Frisian (stq) Community objected to MinT using MADLAD-400
  53. Kalaallisut (kl) Community objected to MinT using MADLAD-400
  54. Southern Altai (alt) Community objected to MinT using MADLAD-400
  55. Northern Sami (se) Community objected to MinT using MADLAD-400
  56. Navajo (nv) Community objected to MinT using MADLAD-400
  57. Zhuang (za) Not supported by MADLAD-400

Steps:

  • Enable all selected languages in the MinT test instance (not for Content/Section translation)
  • Communicate with the communities. Inviting them to try MT quality asking whether the MT quality is useful to be available by default, as an option, or not at all.
    • For languages where Content and Section Translation are not enabled by default yet, the communication can be combined as part of the plans to enable them (T353510). That is informing communities about both the enablement of Content Translation and the possibility of having MinT if the quality is good.
  • Ido Wikipedia (io)
  • Enable: Content and Section Translation in Ido.
  • Don't enable MADLAD-400 MT support. A member of the community tested it and indicated that the quality is poor, and the translation is out of context (adds made up words or phrases not included in the source article into the machine translation).
  • Low German Wikipedia (nds)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Low Saxon Wikipedia(nds-nl)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Mirandese Wikipedia (mwl)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Simple English Wikipedia (simple)
  • The community objected to enabling CX, SX and Machine translation because of the content structure permitted in Simple English Wikipedia, which would derail the project.
  • Aragonese Wikipedia (an)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Tuvinian Wikipedia (tyv)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Cornish Wikipedia (kw)
  • The community feedback is that the translation quality is inferior. Someone rated it 3 on a scale of 1 to 10 when it comes to grammar and spelling, according to Cornish. It also adds some made-up phrases to the translation that are not in the source content. Therefore, the model should not be enabled in their Wikipedia.
  • Kara-Kalpak Wikipedia (kaa)
  • Do not enable the MADLAD-400 MT. A member of the community indicated that the translation is not in Kara-Kalpak language; instead, the output is Uzbek.
  • Cree Wikipedia (cr)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Inuktitut Wikipedia (iu)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Walloon Wikipedia (was)
  • Do not enable the MADLAD-400 MT. A member of the community objected to having the machine translation because the translation model is not perfect and would give admins who are already stretched more work.
  • Manx Wikipedia (gv)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Serbo-Croatian Wikipedia (sh)
  • Do not enable it a community member's feedback is that the translation quality is poor; It also adds some made-up phrases to the translation that are not in the source content.
  • Wu Chinese Wikipedia (wuu)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Sranan Tongo Wikipedia (srn)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Madurese Wikipedia (mad)
  • Do not enable it because a community member said that the translation in not accurate and the translation is in Indonesian language.
  • Breton (be)
  • Do not enable it a community member's feedback is that the translation quality is poor and not suitable as an aid.
  • Talysh (tly)
  • Do not enable it a community member's feedback is that the translation quality is poor and uses Arabic script instead of latin.
  • Bavarian (bar)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Bislama (bi)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Venda (ve)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Nias (nia)
  • Do not enable it a community member's feedback is that the translation quality is poor and adds made up words to the translation.

A set of additional languages for which there is no Wikipedia is supported by MADLAD-400: T354675: Consider enabling MADLAD-400 in MinT for languages with no Wikipedia yet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pginer-WMF renamed this task from Enable MADLAD-400 in MinT test instance to Enable MADLAD-400 in MinT test instance for wikipedia languages not supported by other services.Jan 9 2024, 4:38 PM
Pginer-WMF renamed this task from Enable MADLAD-400 in MinT test instance for wikipedia languages not supported by other services to Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services.

Change 991110 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] MADLAD-400 : Configure languages

https://gerrit.wikimedia.org/r/991110

Change 991110 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] MADLAD-400 : Configure languages

https://gerrit.wikimedia.org/r/991110

Change 991469 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] MADLAD-400: Add Cantonese(zh-yue) support

https://gerrit.wikimedia.org/r/991469

Change 991578 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2024-01-18-051410-production

https://gerrit.wikimedia.org/r/991578

Change 991578 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2024-01-22-053144-production

https://gerrit.wikimedia.org/r/991578

Mentioned in SAL (#wikimedia-operations) [2024-01-22T07:28:09Z] <kart_> Updated MinT to 2024-01-22-053144-production (T355303, T338608, T353510, T354666)

Change 991469 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] MADLAD-400: Add Cantonese(zh-yue) support

https://gerrit.wikimedia.org/r/991469

Change 995170 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2024-01-30-080508-production

https://gerrit.wikimedia.org/r/995170

Change 1004805 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] madlad-400: Remove Zhuang(za) as it is not supported

https://gerrit.wikimedia.org/r/1004805

Change 1004805 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] madlad-400: Remove Zhuang(za) as it is not supported

https://gerrit.wikimedia.org/r/1004805

Change 995170 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2024-02-20-062448-production

https://gerrit.wikimedia.org/r/995170

Mentioned in SAL (#wikimedia-operations) [2024-02-21T05:45:00Z] <kart_> Updated MinT to 2024-02-20-062448-production (T333969, T354666)

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

I tested it out on https://translate.wmcloud.org/ and unfortunately it seems like it can't handle anything longer. The output is a disappointing mix of zh_Hans and zh_Hant. I see that in https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/991469 zh-yue is incorrectly mapped to zh_Hant, which I would place the blame on; but still, its inability to even output zh_Hant properly is disheartening.

image.png (1×1 px, 63 KB)

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

Thanks for confirming this. When I saw that note in the paper I was not sure whether they detected the issue and filtered those items not in Cantonese, or they just identified the issue in the data to explain the unsolved problem.

@UOzurumba there are a couple of languages in the ticket for which there is no indication whether to enable MADLAD-400 or not. Can you confirm the status of conversations for:

  • Moroccan Arabic (ary)
  • Nias (nia)
UOzurumba updated the task description. (Show Details)

@UOzurumba there are a couple of languages in the ticket for which there is no indication whether to enable MADLAD-400 or not. Can you confirm the status of conversations for:

  • Moroccan Arabic (ary)
  • Nias (nia)

Sorry for the omission; I have updated the ticket to capture the conversation. The Moroccan Arabic indicated that the translation is in Standard Arabic in this ticket, and a contributor to Nia Wikipedia mentioned that the translation is of poor quality.

Change #1031758 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/mediawiki-config@master] Enable Content/Section translation in io, nds, nds-nl and, mwl

https://gerrit.wikimedia.org/r/1031758

Change #1031758 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable Content/Section translation in io, nds, nds-nl and, mwl

https://gerrit.wikimedia.org/r/1031758

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:19:07Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:21:51Z] <kartik@deploy1002> kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:31:28Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:34:08Z] <kartik@deploy1002> kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:49:35Z] <kartik@deploy1002> Finished scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] (duration: 18m 06s)

Change #1031887 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] Enable MinT for Wikipedia languages not supported by other services

https://gerrit.wikimedia.org/r/1031887

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

Thanks for confirming this. When I saw that note in the paper I was not sure whether they detected the issue and filtered those items not in Cantonese, or they just identified the issue in the data to explain the unsolved problem.

I was revisiting the paper and browsing the data the other day, and I realised yue didn't seem to have made the final release (which makes sense given the human evaluation comment). Same goes for wuu, which seems missing from the released MADLAD dataset despite being mentioned in the paper.

Looking back at https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/991469 it's now clear to me that zh-yue (Cantonese) was incorrectly mapped to zh-Hant (Mandarin in Traditional script), which would explain the poor results. Cantonese is simply not supported by the MADLAD MT model.

Change #1031887 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Enable MinT for Wikipedia languages not supported by other services

https://gerrit.wikimedia.org/r/1031887

Change #1034211 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-05-20-182409-production

https://gerrit.wikimedia.org/r/1034211

KartikMistry renamed this task from Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services to Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services.Tue, May 21, 5:08 AM
KartikMistry updated the task description. (Show Details)

The enablement seems complete for most wikis. However, it is still pending to enable MinT in Content/Section Translation for some languages:

  • Low German (nds)
  • Mirandese (mwl)
  • Min Nan Chinese (zh-min-nan)
  • Aragonese (an)
  • Cree (cr)
  • Inuktitut (iu)
  • Manx (gv)
  • Ossetic (os)
  • Bavarian (bar)

In addition, in the MinT test instance for Saraiki (skr) the results seem to be showing in English while the Saraiki Wikipedia is not using a latin script. We may want to double-check if the model actually supports Saraiki in oder to provide Saraiki translations or disable it.

translate.wmcloud.org_(Wiki Tablet) (14).png (768×1 px, 87 KB)
skr.wikipedia.org_wiki_%D9%BE%DB%81%D9%84%D8%A7_%D9%BE%D8%B1%D8%AA(Wiki Tablet).png (768×1 px, 160 KB)