Language codes are used to identify languages for Wikipedia and for the translation models used in MinT. Since there are different standards for language codes (using 2 or 3 letters) and exceptional cases, there is code in the system to map the language codes. This task will identify instances that may need fixing.
Mapping from Wikipedia codes to NLLB-200 codes
Mapping in this file may need revision on some Wikipedia codes mapping to the NLLB-200 languages:
- "fil": "tgl_Latn". Expected: "tl": "tgl_Latn" Tagalog Wikipedia uses the "tl" code.
- "gaz": "gaz_Latn". Expected "om": "gaz_Latn" Oromo Wikipedia uses the "om" code. The issue is only present in the "nllb200-600M"model. The "nllb-wikipedia" uses the "om" code as expected, making MinT work currently for the language.
- "npi": "npi_Deva". Expected "ne": "npi_Deva". Nepali Wikipedia uses "ne" code.
- "swh": "swh_Latn". Expected "sw": "swh_Latn". Swahili Wikipedia uses "sw" code.
- "als": "als_Latn". Expected "sq": "als_Latn". NLLB-200 uses "als" for Tosk Albanian. The Albanian Wikipedia (for all Albaniant variants?) uses "sq". Note that Wikipedia uses the non-standard "als" code for Swiss German which is unrelated to Tosk Albanian.
- "plt": "plt_Latn". Expected: "mg": "plt_Latn". The Malagasy Wikipedia uses "mg" code.
- "zsm": "zsm_Latn". Expected: "ms": "zsm_Latn". The Malay wikipedia uses "ms" code.
- "ory": "ory_Orya". Expected "or": "ory_Orya". Oriya Wikipedia uses "or" code.
- "fuv": "fuv_Latn". Expected "ff": "fuv_Latn". Fulfulde Wikipedia uses "ff" code.
- "dik": "dik_Latn". Expected "din": "dik_Latn". Dinka Wikipedia uses "din" code.
- "azj": "azj_Latn". Expected "az": "azj_Latn". Azerbaijani Wikipedia uses "az" code.
- "bho": "bho_Deva". Expected "bh": "bho_Deva". Bhojpuri Wikipedia uses "bh" code.
- "kmr": "kmr_Latn". Expected "ku": "kmr_Latn". Kurdish Wikipedia uses "ku" code.
- "lvs": "lvs_Latn". Expected "lv": "lvs_Latn". Latvian Wikipedia uses "lv" code.
- "khk": "khk_Cyrl". Expected "mn": "khk_Cyrl". Mongolian Wikipedia uses "mn" code.
- "pes": "pes_Arab". Expected "fa": "pes_Arab". Persian Wikipedia uses "fa" code.
- "uzn": "uzn_Latn". Expected "uz": "uzn_Latn". Uzbek Wikipedia uses "uz" code.
- "ydd": "ydd_Hebr". Expected "yi": "ydd_Hebr". Yiddish Wikipedia uses "yi" code.
- "pbt": "pbt_Arab". Expected "ps": "pbt_Arab". Pashto Wikipedia uses "ps" code.
- "quy": "quy_Latn". Expected "qu": "quy_Latn". Quechua Wikipedia uses "qu" code.
- "ayr": "ayr_Latn". Expected "ay": "ayr_Latn" Aymara Wikipedia uses the "ay" code. The issue is only present in the "nllb200-600M"model. The "nllb-wikipedia" uses the "ay" code as expected, making MinT work currently for the language.
- Add: "arz": "arz_Arab" since support for Egyptian Arabic was missing.
- Add: "ary": "ary_Arab" since support for Moroccan Arabic was missing.
List of supported languages
The list of supported languages in this file is expected to use the Wikipedia language codes. It may need to adjust the following:
- bjn is duplicated.
- gaz. Expected: om
- npi. Expected: ne
- fil. Expected: tl
- swh. Expected: sw
- als. Expected sq
- plt. Expected: mg
- zsm. Expected: ms
- ory. Expected or
- fuv. Expected ff
- dik. Expected din
- azj. Expected az
- bho. Expected bh
- kmr. Expected ku
- lvs. Expected lv
- khk. Expected mn
- pes. Expected fa
- uzn. Expected uz
- ydd. Expected yi
- pbt. Expected ps
- quy. Expected qu
- ayr". Expected ay
- Add: arz under nllb200-600M since support for Egyptian Arabic was missing.
- Add: ary under nllb200-600M since support for Moroccan Arabic was missing.