Page MenuHomePhabricator

Add language-mapping data
Closed, ResolvedPublic5 Estimated Story Points

Description

Add a /public/langs.json file that maps between Wikimedia language codes (ISO 639-1 or BCP 47?) to the different engine's codes (which will also indicate engine support for a language, and so we'll be able to switch the multiselect widget to only show the correct ones).

For example:

{
    "en": {
        "tesseract": "eng",
        "google": "en"
    },
    "bn": {
        "tesseract": "ben",
        "google": "bn"
    },
    "uz": {
        "tesseract": "uzb",
        "google": "uz"
    },
    "uz-cyrl": {
        "tesseract": "uzb_cyrl",
        "google": "uz-Cyrl"
    },
    "he": {
        "tesseract": "heb",
        "google": "iw"
    },
    "it_old ??": {
        "tesseract": "ita_old",
    }
}

This way, we'll be able to display a localized list of available languages in the wiki UI.

Outstanding question: how do we deal with language codes such as ita_old which don't have direct mappings in ISO639-1?

We also need to include localized language names in user's current interface depending on the engine

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Reedy renamed this task from Add langaguge-mapping JSON file to Add language-mapping JSON file.May 13 2021, 1:28 AM
ldelench_wmf renamed this task from Add language-mapping JSON file to Add language-mapping data.May 13 2021, 11:34 PM
ldelench_wmf updated the task description. (Show Details)
ldelench_wmf set the point value for this task to 5.
ldelench_wmf moved this task from New & TBD Tickets to Up Next (May 6-17) on the Community-Tech board.

PR: https://github.com/wikimedia/wikimedia-ocr/pull/33

This also addresses T281866: Wikimedia OCR: "-" and "_" being stripped from language codes and supersedes the work done for T282073: Add API endpoint to retrieve supported languages.


How the language list was built
Languages that were not included

Unavailable in newer versions of Tesseract:

  • Danish - Fraktur (contrib) (dan_frak)
  • German - Fraktur (contrib) (deu_frak)
  • Kurdish (Arabic Script) (kur)
  • Slovak - Fraktur (contrib) (slk_frak)
  • Tagalog (new - Filipino) (tgl)

Unavailable on Tesseract, but 'mapped' on Google (these should probably be included, but we'll have to figure out what Google maps them to):

  • Asturianu (ast)
  • Balinese (ban)
  • Bamanankan (bm)
  • Bashkir (ba)
  • Baso Minangkabau (min)
  • Bemba (bem)
  • Diné Bizaad (nv)
  • Gaelg (gv)
  • IsiXhosa (xh)
  • Kajin M̧ajeļ (mh)
  • Lingala (ln)
  • Luganda (lg)
  • Malagasy (mg)
  • Mapudungung (arn)
  • Nāhuatl (nah)
  • Ndonga (ng)
  • Norrœnt (non)
  • Hawaiian (haw)
  • Plattdüütsch (nds)
  • Rumantsch (rm)
  • Scots (sco)
  • Сахалыы (sah)
  • Чăваш (cv)
  • Удмурт (udm)
  • 𐌲𐌿𐍄𐌹𐍃𐌺 (got)

Languages listed at wikisource.org that are unsupported by both Google and Tesseract:

  • Alemannisch (gsw)
  • Aragonés (an)
  • Armâneascâ (rup)
  • Avestan (ae)
  • Bân-lâm-gú (nan)
  • Banjar (bjn)
  • Boarisch (bar)
  • Bolak (?)
  • بلوچی (bal)
  • ChiShona (sn)
  • Chono (?)
  • Crnogorski (cnr)
  • Davvisámegiella (se)
  • Dolnoserbšćina (dsb)
  • Estremeñu (ext)
  • Furlan (fur)
  • Gagauz (gag)
  • 客家語 / Hak-kâ-ngî (hak)
  • Hinónoʼeitíít (arp)
  • Hornjoserbšćina (hsb)
  • Ido (io)
  • Interlingua (ia)
  • Istriota (ist)
  • Judeo-Español (lad)
  • Karjala (krl)
  • Kaszëbsczi (csb)
  • Kemi Sami (sjk)
  • Kernewek (kw)
  • Kurdî / كوردي (ku)
  • Ladin (lld)
  • Laraʼ (lra)
  • Lese (les)
  • Lëtzebuergesch (lb)
  • Ligure (lij)
  • Limburgs (li)
  • Līvõ kēļ (liv)
  • Livvikarjalan (olo)
  • Lumbaart (lmo)
  • Lojban (jbo)
  • 閩北語 / Mâing-bă̤-ngṳ̌ (mnp)
  • 閩東語 / Mìng-dĕ̤ng-ngṳ (cdo)
  • Mirandés (mwl)
  • Morisien (mfe)
  • Napulitano (nap)
  • Nordfriisk (frr)
  • Norn (nrn)
  • Ɔl Maa (mas)
  • Onödowága (see)
  • Palau (pau)
  • Pāḷi / पाळि (pi)
  • Picard (pcd)
  • Piemontèis (pms)
  • Plautdietsch (pdt)
  • Wenska rec (pox)
  • Pó-sing-gṳ̂ (cpx)
  • Reo Tahiti (tah)
  • Rheifränggisch (pfl)
  • Rumârește (ruo)
  • Sahsisk (osx)
  • Salırça (slr)
  • Sardu (sc)
  • Seeltersk (stq)
  • Sicilianu (scn)
  • Sukuma (suk)
  • Tetun (tet)
  • Tupynã'mbá (tpn)
  • Tutonish (?)
  • Vèneto (vec)
  • Vepsän kel' (vep)
  • Volapük (vo)
  • Wymysöryś (wym)
  • Zazaki (diq)
  • Walon (wa)
  • Аҧсуа (ab)
  • Адыгэбзэ (ady)
  • Кӣллт са̄мь кӣлл (sjd)
  • Кыргызча (ky)
  • Кырык мары (mrj)
  • Кърымчах (jct)
  • Мокшень (mdf)
  • Нанай (gld)
  • Олык марий (mhr)
  • Перем Коми (koi)
  • Романы (rml)
  • Слове́нскїй (chu)
  • Словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ (cu)
  • Эрзянь (myv)
  • Αρbε̰ρίσ̈τε (aat)
  • Ποντιακά (pnt)
  • მარგალური (xmf)
  • 𒀝𒂵𒌈 (akk)
  • Ϯⲁⲥⲡⲓ ̀ⲛⲣⲉⲙ̀ⲛⲬⲏⲙⲓ (cop)
  • বিষ্ণুপ্রিয়া মণিপুরী (bpy)
  • बडो (brx)
  • ꓡꓲꓢꓴ (lis)
  • मैथिली (mai)
  • कांगड़ी (xnr)
  • ᐃᓄᒃᑎᑐᑦ / Inuktitut (iu)
  • 文言文 (lzh)
  • ᡤᡳᠰᡠᠨ (mnc)
  • うちなーぐち (ryu)
  • ᡤᡞᠰᡠᠨ (sjo)
  • 𗼇𗟲 (txg)
  • 粵文 (yue)
  • عثمانلوجه (ota)
  • پنجابی (pnb)
  • ئۇيغۇرچە (ug)
  • Ancient Egyptian (egy)
  • Old Persian (peo)

Everything has been merged.

QA notes: this is the same chunk of work as T282073. See above (T282760#7130811) on how I made the list of languages. This was very tedious work, but I double-checked and I think we got all relevant languages added (including Google's experimental languages, and a few "mapped" languages where there are Tesseract equivalents). I will admit that it is still possible that I made a typo or two.

The question on what to do, if anything, about the outstanding "mapped" languages still stands.

Note this work effectively solves T281866 too since we only accept ISO 639-1 now.

dom_walden subscribed.

Tesseract

I compared our list from https://ocr-test.wmcloud.org/api/available_langs?engine=tesseract to what Tesseract claims to support in https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html.

@MusikAnimal I am unsure about some of the below.

Languages we list in our API but I am unsure about Tesseract support:

  • de-frk is apparently not supported in Tesseract 4.0.0. But works if I submit with this language.
  • fro perhaps should be frm (see https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes)? fro apparently is not supported in Tesseract
  • gv (or glv in ISO-639-2) is not listed by Tesseract. Get a 500 error when I try to use it.
  • ku (kur in ISO-639-2) not listed as supported by version 4.0.0. But it works if I submit with this language.
  • tl (tgl in ISO-639-2) not listed as supported by version 4.0.0. But works if I submit with this language.
  • zh, zh-hans, zh-hant not listed as supported by Tesseract, but chi_sim ("Chinese - Simplified") and chi_tra ("Chinese - Traditional") are. All three work if I submit them.

Languages Tesseract lists but we do not:

  • chi_sim, chi_tra see above. Someone who knows more about this might have to comment. I think zh-hans = Chinese Simplified and zh-hant = Chinese Traditional.
  • fil ("Filipino") this has no ISO-639-1 equivalent.
  • frk ("German - Fraktur") I could not find this as either ISO-639-1 or ISO-639-2. In ISO-639-3 it is listed as "Frankish" (https://iso639-3.sil.org/code/frk)
  • frm I think we mistakenly list this as fro. See above
  • kmr ("Kurmanji (Kurdish - Latin Script)") I cannot find this listed as ISO-639-1/2/3.

Google

I compared our list in https://ocr-test.wmcloud.org/api/available_langs?engine=google to what Google claims to support in https://cloud.google.com/vision/docs/languages.

We match all the "supported" and "experimental" languages and some of the "mapped" languages, except:

  • "ru-PETR1708" which is "supported" by Google, but we don't list it and I cannot find it in ISO-639-1/2/3

I found this bug while testing this: T284654.

Test environment: https://ocr-test.wmcloud.org Version 0.5.0-3-g1bf762c

Thank you for the thorough review! I needed a second set of eyes :) This was very tedious work.

MusikAnimal I am unsure about some of the below.

Sorry, I should have better explained how the mapping works. We only accept ISO 639-1 and some special edge cases. The full list is in langs,json. So entering one language code might get transformed when it's passed to the engine.

I've made noted the changes I've made based on your feedback.

  • de-frk is apparently not supported in Tesseract 4.0.0. But works if I submit with this language.

It's a made-up language code to be consistent with what we're used to on the wiki. Deutsch should start with de, so I went with de-frk which gets mapped to frk, which does appear to be supported by Tesseract 4.0.

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

  • gv (or glv in ISO-639-2) is not listed by Tesseract. Get a 500 error when I try to use it.

I for some reason left that blank in langs.json, which is why you get a 500. It is only available as a "mapped" language in Google, so I've removed it entirely from langs.json

  • ku (kur in ISO-639-2) not listed as supported by version 4.0.0. But it works if I submit with this language.

https://ku.wikipedia.org suggests Wikimedia's Kurdish (ku) is a Latin script, so I have it mapped to krm for Tesseract. I have however added kur to the list, so users are able to choose Arabic script for Tesseract.

  • tl (tgl in ISO-639-2) not listed as supported by version 4.0.0. But works if I submit with this language.

This gets mapped to fil which is supported by Tesseract 4. I think the actual language it's supposed to be is tgl, as you say, but that is not supported by Tesseract 4, so I guess my thinking was that was better than nothing… but I could be wrong in making that assumption.

  • zh, zh-hans, zh-hant not listed as supported by Tesseract, but chi_sim ("Chinese - Simplified") and chi_tra ("Chinese - Traditional") are. All three work if I submit them.

All three I believe are correctly mapped in langs.json.

Languages Tesseract lists but we do not:

  • chi_sim, chi_tra see above. Someone who knows more about this might have to comment. I think zh-hans = Chinese Simplified and zh-hant = Chinese Traditional.

Same as previous.

  • fil ("Filipino") this has no ISO-639-1 equivalent.

I think it's actually tl. That's what Google uses as well as Wikimedia (see https://tl.wiktionary.org for example).

de-frk maps to frk.

  • frm I think we mistakenly list this as fro. See above

Yup, I've got that fixed.

  • kmr ("Kurmanji (Kurdish - Latin Script)") I cannot find this listed as ISO-639-1/2/3.

This is ku according to us (i.e. https://ku.wikipedia.org), which maps to kmr for Tesseract.

Google
...
We match all the "supported" and "experimental" languages and some of the "mapped" languages, except:

  • "ru-PETR1708" which is "supported" by Google, but we don't list it and I cannot find it in ISO-639-1/2/3

I've added it.

I found this bug while testing this: T284654.

Already fixed by Daimona! Thanks for finding that.

Thanks for looking into those for me @MusikAnimal.

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

Does this need to be added to LANG_NAMES? It is blank at the moment.

I am not sure what the correct name is. According to https://en.wikipedia.org/wiki/Middle_French it should be françois; franceis. According to https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=146 it should be français moyen (1400-1600).

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

Does this need to be added to LANG_NAMES? It is blank at the moment.

I am not sure what the correct name is. According to https://en.wikipedia.org/wiki/Middle_French it should be françois; franceis. According to https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=146 it should be français moyen (1400-1600).

Yes! Done with https://github.com/wikimedia/wikimedia-ocr/pull/43. I went with moyen français (1400-1600) per https://fr.wikipedia.org/wiki/Moyen_fran%C3%A7ais

Thanks. As this is such a small change, I will just move this into Done.