Page MenuHomePhabricator

Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services
Closed, ResolvedPublic

Description

The MADLAD-400 open source translation model supports many languages. Initial testing suggests quality may not be always very high, but can be still useful for those languages not supported by other services. Community input can help to identify when it is useful.

This ticket proposes to enable the MADLAD-400 model in the MinT test instance for communities to be able to try it. We selected Wikipedia languages which are not supported by any other translaiton service.

These are the languages selected:

(In bold languages with Content and Section translation enablement planned in T353510)

  1. Arpitan (frp)
  2. Kabardian (kbd)
  3. Moksha (mdf)
  4. Gorontalo (gor)
  5. Avar (av)
  6. Komi-Permyak (koi)
  7. Chechen (ce)
  8. Erzya (myv)
  9. Adyghe (ady)
  10. Newari (new)
  11. Kalmyk (xal)
  12. Jamaican Creole English (jam)
  13. Mon (mnw)
  14. Fiji Hindi (hif)
  15. Komi (kv)
  16. Tulu (tcy)
  17. Pampanga (pam)
  18. Tetum (tet)
  19. Karachay-Balkar (krc)
  20. Chamorro (ch)
  21. Gagauz (gag)
  22. Old English (ang)
  23. Aragonese (an) Apertium supports some pairs, MADLAD can provide support for other source languages
  24. Bavarian (bar)
  25. Bislama (bi) OpusMT supports translations form English, MADLAD can provide support for other source languages
  26. Cree (cr) cr_Latn code used by MADLAD-400!
  27. Manx (gv)
  28. Inuktitut (iu)
  29. Mirandese (mwl)
  30. Nan (nan) ! zh-min-nan code used in Wikipedia. nan_Latn_TW code used by MADLAD-400!
  31. Low German (nds)
  32. Low Saxon (nds-nl) nds_NL code used by MADLAD-400!
  33. Ossetic (os)
  34. Saraiki (skr)
  35. Sranan Tongo (srn)
  36. Tuvinian (tyv)
  37. Venda (ve) OpusMT supports translations form English, MADLAD can provide support for other source languages
  38. Wu Chinese (wuu)
  39. Moroccan Arabic (ary) Community objected to MinT using MADLAD-400 MADLAD-400 is not providing the right variant according to T339926
  40. Breton (br) Community objected to MinT using MADLAD-400 Apertium supports some pairs, MADLAD can provide support for other source languages
  41. Ido (io) Community objected to MinT using MADLAD-400
  42. Kara-Kalpak (kaa) Community objected to MinT using MADLAD-400
  43. Cornish (kw) Community objected to MinT using MADLAD-400
  44. Madurese (mad) Community objected to MinT using MADLAD-400
  45. Nias (nia) Community objected to MinT using MADLAD-400
  46. Serbo-Croatian (sh) Community objected to MinT using MADLAD-400
  47. Simple English (simple) Community objected to MinT using MADLAD-400
  48. Talysh (tly) Community objected to MinT using MADLAD-400 tly_IR code used by MADLAD-400, but Wikipedia seems to use the latin script instead.
  49. Walloon (wa) Community objected to MinT using MADLAD-400
  50. Cantonese (yue) zh-yue code used in Wikipedia. As per T354666#9593836, MADLAD-400 has the same issues as other models not providing the right variant for the language ( T333835).
  51. Romansh (rm) Community objected to MinT using MADLAD-400
  52. Saterland Frisian (stq) Community objected to MinT using MADLAD-400
  53. Kalaallisut (kl) Community objected to MinT using MADLAD-400
  54. Southern Altai (alt) Community objected to MinT using MADLAD-400
  55. Northern Sami (se) Community objected to MinT using MADLAD-400
  56. Navajo (nv) Community objected to MinT using MADLAD-400
  57. Zhuang (za) Not supported by MADLAD-400

Steps:

  • Enable all selected languages in the MinT test instance (not for Content/Section translation)
  • Communicate with the communities. Inviting them to try MT quality asking whether the MT quality is useful to be available by default, as an option, or not at all.
    • For languages where Content and Section Translation are not enabled by default yet, the communication can be combined as part of the plans to enable them (T353510). That is informing communities about both the enablement of Content Translation and the possibility of having MinT if the quality is good.
  • Ido Wikipedia (io)
  • Enable: Content and Section Translation in Ido.
  • Don't enable MADLAD-400 MT support. A member of the community tested it and indicated that the quality is poor, and the translation is out of context (adds made up words or phrases not included in the source article into the machine translation).
  • Low German Wikipedia (nds)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Low Saxon Wikipedia(nds-nl)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Mirandese Wikipedia (mwl)
  • Enable the Content and Section translation and MADLAD-400 in this Wiki; there was no response or objection to enabling it.
  • Simple English Wikipedia (simple)
  • The community objected to enabling CX, SX and Machine translation because of the content structure permitted in Simple English Wikipedia, which would derail the project.
  • Aragonese Wikipedia (an)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Tuvinian Wikipedia (tyv)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Cornish Wikipedia (kw)
  • The community feedback is that the translation quality is inferior. Someone rated it 3 on a scale of 1 to 10 when it comes to grammar and spelling, according to Cornish. It also adds some made-up phrases to the translation that are not in the source content. Therefore, the model should not be enabled in their Wikipedia.
  • Kara-Kalpak Wikipedia (kaa)
  • Do not enable the MADLAD-400 MT. A member of the community indicated that the translation is not in Kara-Kalpak language; instead, the output is Uzbek.
  • Cree Wikipedia (cr)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Inuktitut Wikipedia (iu)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Walloon Wikipedia (was)
  • Do not enable the MADLAD-400 MT. A member of the community objected to having the machine translation because the translation model is not perfect and would give admins who are already stretched more work.
  • Manx Wikipedia (gv)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Serbo-Croatian Wikipedia (sh)
  • Do not enable it a community member's feedback is that the translation quality is poor; It also adds some made-up phrases to the translation that are not in the source content.
  • Wu Chinese Wikipedia (wuu)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Sranan Tongo Wikipedia (srn)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Madurese Wikipedia (mad)
  • Do not enable it because a community member said that the translation in not accurate and the translation is in Indonesian language.
  • Breton (be)
  • Do not enable it a community member's feedback is that the translation quality is poor and not suitable as an aid.
  • Talysh (tly)
  • Do not enable it a community member's feedback is that the translation quality is poor and uses Arabic script instead of latin.
  • Bavarian (bar)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Bislama (bi)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Venda (ve)
  • Enable MADLAD-400 model in this Wiki; there was no response or objection to enabling it.
  • Nias (nia)
  • Do not enable it a community member's feedback is that the translation quality is poor and adds made up words to the translation.

A set of additional languages for which there is no Wikipedia is supported by MADLAD-400: T354675: Consider enabling MADLAD-400 in MinT for languages with no Wikipedia yet

Related Objects

Mentioned In
T370749: Re-run the MT service usage report (2024)
T361582: Enable Content and Section translation on Wikipedias without current machine translation support to facilitate the support in the future
T365230: Post-creation work for dtpwiki
T361597: Fix the mobile experience for a second group of Wikipedias where Content Translation is in beta
T333969: Enable Opus models for languages lacking other Machine Translation options
T338608: Support requesting translations from a specific model in MinT
T355303: Adjust multiple model support on MinT test instance
T355296: Enable MADLAD-400 in MinT test instance for languages only supported by one external service
T339926: The NLLB-200 MT engine in MinT returns standard Arabic translation instead of Moroccan Darija in Moroccan Arabic Wikipedia
T354675: Consider enabling MADLAD-400 in MinT for languages with no Wikipedia yet
T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400
Mentioned Here
T365230: Post-creation work for dtpwiki
T333969: Enable Opus models for languages lacking other Machine Translation options
T338608: Support requesting translations from a specific model in MinT
T355303: Adjust multiple model support on MinT test instance
T333835: Disable machine translation for Cantonese
T354675: Consider enabling MADLAD-400 in MinT for languages with no Wikipedia yet
T339926: The NLLB-200 MT engine in MinT returns standard Arabic translation instead of Moroccan Darija in Moroccan Arabic Wikipedia
T353510: Enable Content and Section translation on some Wikipedias with potential to be supported with MinT using MADLAD-400

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2024-01-22T07:28:09Z] <kart_> Updated MinT to 2024-01-22-053144-production (T355303, T338608, T353510, T354666)

Change 991469 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] MADLAD-400: Add Cantonese(zh-yue) support

https://gerrit.wikimedia.org/r/991469

Change 995170 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update MinT to 2024-01-30-080508-production

https://gerrit.wikimedia.org/r/995170

Change 1004805 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/machinetranslation@master] madlad-400: Remove Zhuang(za) as it is not supported

https://gerrit.wikimedia.org/r/1004805

Change 1004805 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] madlad-400: Remove Zhuang(za) as it is not supported

https://gerrit.wikimedia.org/r/1004805

Change 995170 merged by jenkins-bot:

[operations/deployment-charts@master] Update MinT to 2024-02-20-062448-production

https://gerrit.wikimedia.org/r/995170

Mentioned in SAL (#wikimedia-operations) [2024-02-21T05:45:00Z] <kart_> Updated MinT to 2024-02-20-062448-production (T333969, T354666)

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

I tested it out on https://translate.wmcloud.org/ and unfortunately it seems like it can't handle anything longer. The output is a disappointing mix of zh_Hans and zh_Hant. I see that in https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/991469 zh-yue is incorrectly mapped to zh_Hant, which I would place the blame on; but still, its inability to even output zh_Hant properly is disheartening.

image.png (1×1 px, 63 KB)

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

Thanks for confirming this. When I saw that note in the paper I was not sure whether they detected the issue and filtered those items not in Cantonese, or they just identified the issue in the data to explain the unsolved problem.

@UOzurumba there are a couple of languages in the ticket for which there is no indication whether to enable MADLAD-400 or not. Can you confirm the status of conversations for:

  • Moroccan Arabic (ary)
  • Nias (nia)
UOzurumba updated the task description. (Show Details)

@UOzurumba there are a couple of languages in the ticket for which there is no indication whether to enable MADLAD-400 or not. Can you confirm the status of conversations for:

  • Moroccan Arabic (ary)
  • Nias (nia)

Sorry for the omission; I have updated the ticket to capture the conversation. The Moroccan Arabic indicated that the translation is in Standard Arabic in this ticket, and a contributor to Nia Wikipedia mentioned that the translation is of poor quality.

Change #1031758 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/mediawiki-config@master] Enable Content/Section translation in io, nds, nds-nl and, mwl

https://gerrit.wikimedia.org/r/1031758

Change #1031758 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable Content/Section translation in io, nds, nds-nl and, mwl

https://gerrit.wikimedia.org/r/1031758

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:19:07Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:21:51Z] <kartik@deploy1002> kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:31:28Z] <kartik@deploy1002> Started scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:34:08Z] <kartik@deploy1002> kartik: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-15T07:49:35Z] <kartik@deploy1002> Finished scap: Backport for [[gerrit:1031758|Enable Content/Section translation in io, nds, nds-nl and, mwl (T354666)]] (duration: 18m 06s)

Change #1031887 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] Enable MinT for Wikipedia languages not supported by other services

https://gerrit.wikimedia.org/r/1031887

Please do not consider yue (Cantonese) support at this time. Like NLLB/FLORES-200 this dataset again suffers from the common problem of "yue" data not being yue. To quote the paper itself, "pretty low quality; mostly not Canto". The evaluation scores alone are painfully low, and even worse then NLLB if I'm interpreting them correctly.

Thanks for confirming this. When I saw that note in the paper I was not sure whether they detected the issue and filtered those items not in Cantonese, or they just identified the issue in the data to explain the unsolved problem.

I was revisiting the paper and browsing the data the other day, and I realised yue didn't seem to have made the final release (which makes sense given the human evaluation comment). Same goes for wuu, which seems missing from the released MADLAD dataset despite being mentioned in the paper.

Looking back at https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslation/+/991469 it's now clear to me that zh-yue (Cantonese) was incorrectly mapped to zh-Hant (Mandarin in Traditional script), which would explain the poor results. Cantonese is simply not supported by the MADLAD MT model.

Change #1031887 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Enable MinT for Wikipedia languages not supported by other services

https://gerrit.wikimedia.org/r/1031887

Change #1034211 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-05-20-182409-production

https://gerrit.wikimedia.org/r/1034211

KartikMistry renamed this task from Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services to Enable MADLAD-400 in MinT test instance and Production for Wikipedia languages not supported by other services.May 21 2024, 5:08 AM
KartikMistry updated the task description. (Show Details)

The enablement seems complete for most wikis. However, it is still pending to enable MinT in Content/Section Translation for some languages:

  • Low German (nds)
  • Mirandese (mwl)
  • Min Nan Chinese (zh-min-nan)
  • Aragonese (an)
  • Cree (cr)
  • Inuktitut (iu)
  • Manx (gv)
  • Ossetic (os)
  • Bavarian (bar)

In addition, in the MinT test instance for Saraiki (skr) the results seem to be showing in English while the Saraiki Wikipedia is not using a latin script. We may want to double-check if the model actually supports Saraiki in oder to provide Saraiki translations or disable it.

translate.wmcloud.org_(Wiki Tablet) (14).png (768×1 px, 87 KB)
skr.wikipedia.org_wiki_%D9%BE%DB%81%D9%84%D8%A7_%D9%BE%D8%B1%D8%AA(Wiki Tablet).png (768×1 px, 160 KB)

Change #1034211 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-05-20-182409-production

https://gerrit.wikimedia.org/r/1034211

Mentioned in SAL (#wikimedia-operations) [2024-05-27T06:25:06Z] <kart_> Updated cxserver to 2024-05-20-182409-production (T354666, T365230)

The enablement seems complete for most wikis. However, it is still pending to enable MinT in Content/Section Translation for some languages:

  • Low German (nds)
  • Mirandese (mwl)
  • Min Nan Chinese (zh-min-nan)
  • Aragonese (an)
  • Cree (cr)
  • Inuktitut (iu)
  • Manx (gv)
  • Ossetic (os)
  • Bavarian (bar)

After the recent deployment MinT is showing for most of the above languages when sed in Content/Section Translation. However, that is not the case for Min Nan Chinese (zh-min-nan). A potential mismatch between the Wikipedia code (zh-min-nan) and the language code used by the model may be the cause.

Note how MinT is not listed when making a translation to Min Nan in Section Translation:

Screenshot 2024-05-27 at 16.24.45.png (737×557 px, 49 KB)

Change #1037947 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/cxserver@master] MinT: Add nan and map code with zh_min_nan

https://gerrit.wikimedia.org/r/1037947

Tested again with this translation and MinT is still not showing for Min Nan Chinese:

zh-min-nan.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=en&to=nan&page=Alpha%20Centauri(Wiki Mobile).png (568×320 px, 22 KB)

Change #1037947 merged by jenkins-bot:

[mediawiki/services/cxserver@master] MinT: Add nan and map code with zh_min_nan

https://gerrit.wikimedia.org/r/1037947

Change #1054340 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-07-15-100650-production

https://gerrit.wikimedia.org/r/1054340

Change #1054340 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-07-15-100650-production

https://gerrit.wikimedia.org/r/1054340

Mentioned in SAL (#wikimedia-operations) [2024-07-16T06:18:35Z] <kart_> Updated cxserver to 2024-07-15-100650-production (T354666)

Tested again with this translation and MinT is still not showing for Min Nan Chinese:

zh-min-nan.m.wikipedia.org_w_index.php_title=Special_ContentTranslation&active-list=suggestions&from=en&to=nan&page=Alpha%20Centauri(Wiki Mobile).png (568×320 px, 22 KB)

We just enabled but it seems the translation quality is poor in zh-min-nan (showing up non-Chinese characters) but you can check and confirm.

Pginer-WMF closed this task as Resolved.EditedJul 16 2024, 8:46 AM

We just enabled but it seems the translation quality is poor in zh-min-nan (showing up non-Chinese characters) but you can check and confirm.

The Min Nan Wikipedia seems to be using latin script for the language. So I think we can keep MinT enabled until we get more information from the community.