Page MenuHomePhabricator

Review code mappings for MinT
Closed, ResolvedPublic

Description

Language codes are used to identify languages for Wikipedia and for the translation models used in MinT. Since there are different standards for language codes (using 2 or 3 letters) and exceptional cases, there is code in the system to map the language codes. This task will identify instances that may need fixing.

Mapping from Wikipedia codes to NLLB-200 codes

Mapping in this file may need revision on some Wikipedia codes mapping to the NLLB-200 languages:

  • "fil": "tgl_Latn". Expected: "tl": "tgl_Latn" Tagalog Wikipedia uses the "tl" code.
  • "gaz": "gaz_Latn". Expected "om": "gaz_Latn" Oromo Wikipedia uses the "om" code. The issue is only present in the "nllb200-600M"model. The "nllb-wikipedia" uses the "om" code as expected, making MinT work currently for the language.
  • "npi": "npi_Deva". Expected "ne": "npi_Deva". Nepali Wikipedia uses "ne" code.
  • "swh": "swh_Latn". Expected "sw": "swh_Latn". Swahili Wikipedia uses "sw" code.
  • "als": "als_Latn". Expected "sq": "als_Latn". NLLB-200 uses "als" for Tosk Albanian. The Albanian Wikipedia (for all Albaniant variants?) uses "sq". Note that Wikipedia uses the non-standard "als" code for Swiss German which is unrelated to Tosk Albanian.
  • "plt": "plt_Latn". Expected: "mg": "plt_Latn". The Malagasy Wikipedia uses "mg" code.
  • "zsm": "zsm_Latn". Expected: "ms": "zsm_Latn". The Malay wikipedia uses "ms" code.
  • "ory": "ory_Orya". Expected "or": "ory_Orya". Oriya Wikipedia uses "or" code.
  • "fuv": "fuv_Latn". Expected "ff": "fuv_Latn". Fulfulde Wikipedia uses "ff" code.
  • "dik": "dik_Latn". Expected "din": "dik_Latn". Dinka Wikipedia uses "din" code.
  • "azj": "azj_Latn". Expected "az": "azj_Latn". Azerbaijani Wikipedia uses "az" code.
  • "bho": "bho_Deva". Expected "bh": "bho_Deva". Bhojpuri Wikipedia uses "bh" code.
  • "kmr": "kmr_Latn". Expected "ku": "kmr_Latn". Kurdish Wikipedia uses "ku" code.
  • "lvs": "lvs_Latn". Expected "lv": "lvs_Latn". Latvian Wikipedia uses "lv" code.
  • "khk": "khk_Cyrl". Expected "mn": "khk_Cyrl". Mongolian Wikipedia uses "mn" code.
  • "pes": "pes_Arab". Expected "fa": "pes_Arab". Persian Wikipedia uses "fa" code.
  • "uzn": "uzn_Latn". Expected "uz": "uzn_Latn". Uzbek Wikipedia uses "uz" code.
  • "ydd": "ydd_Hebr". Expected "yi": "ydd_Hebr". Yiddish Wikipedia uses "yi" code.
  • "pbt": "pbt_Arab". Expected "ps": "pbt_Arab". Pashto Wikipedia uses "ps" code.
  • "quy": "quy_Latn". Expected "qu": "quy_Latn". Quechua Wikipedia uses "qu" code.
  • "ayr": "ayr_Latn". Expected "ay": "ayr_Latn" Aymara Wikipedia uses the "ay" code. The issue is only present in the "nllb200-600M"model. The "nllb-wikipedia" uses the "ay" code as expected, making MinT work currently for the language.
  • Add: "arz": "arz_Arab" since support for Egyptian Arabic was missing.
  • Add: "ary": "ary_Arab" since support for Moroccan Arabic was missing.

List of supported languages

The list of supported languages in this file is expected to use the Wikipedia language codes. It may need to adjust the following:

  • bjn is duplicated.
  • gaz. Expected: om
  • npi. Expected: ne
  • fil. Expected: tl
  • swh. Expected: sw
  • als. Expected sq
  • plt. Expected: mg
  • zsm. Expected: ms
  • ory. Expected or
  • fuv. Expected ff
  • dik. Expected din
  • azj. Expected az
  • bho. Expected bh
  • kmr. Expected ku
  • lvs. Expected lv
  • khk. Expected mn
  • pes. Expected fa
  • uzn. Expected uz
  • ydd. Expected yi
  • pbt. Expected ps
  • quy. Expected qu
  • ayr". Expected ay
  • Add: arz under nllb200-600M since support for Egyptian Arabic was missing.
  • Add: ary under nllb200-600M since support for Moroccan Arabic was missing.

Event Timeline

Pginer-WMF triaged this task as Medium priority.May 11 2023, 4:37 PM

Maybe also fix ory and or. They both refer to Odia. For some reason, we seem to have both in NLLB. Wikimedia uses or, and it should be used as much as possible.

Maybe also fix ory and or. They both refer to Odia. For some reason, we seem to have both in NLLB. Wikimedia uses or, and it should be used as much as possible.

Thanks Amir! I added items to adjust Odia too.

Change 919828 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] Update code mappings

https://gerrit.wikimedia.org/r/919828

For Norwegian Bokmål, the configuration uses "nb" as code, but Wikipedia redirects "nb" to "no". Based on previous issues with Norwegian variants, would be worth checking which should be the best code to use here.

Change 919828 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] Update language code mappings

https://gerrit.wikimedia.org/r/919828

Change 920250 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Updated MinT to 2023-05-16-112045-production

https://gerrit.wikimedia.org/r/920250

Change 920250 merged by jenkins-bot:

[operations/deployment-charts@master] Updated MinT to 2023-05-16-112045-production

https://gerrit.wikimedia.org/r/920250

@KartikMistry I think the following change was not actually made:

List of supported languages

The list of supported languages in this file is expected to use the Wikipedia language codes. It may need to adjust the following:
[...]

  • khk. Expected mn

Mongolian is expected to be listed there using the Wikipedia code (mn) instead of the NLLB-200 code (khk)

Change 924966 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[mediawiki/services/machinetranslation@master] config.yaml: Fix language code for Mongolian

https://gerrit.wikimedia.org/r/924966

Change 924966 merged by jenkins-bot:

[mediawiki/services/machinetranslation@master] config.yaml: Fix language code for Mongolian

https://gerrit.wikimedia.org/r/924966

Change 924915 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] MinT: Update to 2023-06-01-041041-production

https://gerrit.wikimedia.org/r/924915

Change 924915 merged by jenkins-bot:

[operations/deployment-charts@master] MinT: Update to 2023-06-01-041041-production

https://gerrit.wikimedia.org/r/924915

Mentioned in SAL (#wikimedia-operations) [2023-06-01T06:16:00Z] <kart_> Updated MinT to 2023-06-01-041041-production (T336525)

@KartikMistry I think the following change was not actually made:

List of supported languages

The list of supported languages in this file is expected to use the Wikipedia language codes. It may need to adjust the following:
[...]

  • khk. Expected mn

Mongolian is expected to be listed there using the Wikipedia code (mn) instead of the NLLB-200 code (khk)

Thanks, Pau for noticing this. Fixed and deployed.