Add language-mapping data
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	Samwilson
	May 13 2021, 1:25 AM

Description

Add a /public/langs.json file that maps between Wikimedia language codes (ISO 639-1 or BCP 47?) to the different engine's codes (which will also indicate engine support for a language, and so we'll be able to switch the multiselect widget to only show the correct ones).

For example:

{
    "en": {
        "tesseract": "eng",
        "google": "en"
    },
    "bn": {
        "tesseract": "ben",
        "google": "bn"
    },
    "uz": {
        "tesseract": "uzb",
        "google": "uz"
    },
    "uz-cyrl": {
        "tesseract": "uzb_cyrl",
        "google": "uz-Cyrl"
    },
    "he": {
        "tesseract": "heb",
        "google": "iw"
    },
    "it_old ??": {
        "tesseract": "ita_old",
    }
}

This way, we'll be able to display a localized list of available languages in the wiki UI.

Outstanding question: how do we deal with language codes such as ita_old which don't have direct mappings in ISO639-1?

We also need to include localized language names in user's current interface depending on the engine

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		ldelench_wmf	T282052 SPIKE: Investigate complexity of language sharing in OCRs
		Resolved		MusikAnimal	T282760 Add language-mapping data

Event Timeline

Samwilson created this task.May 13 2021, 1:25 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptMay 13 2021, 1:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Reedy renamed this task from Add langaguge-mapping JSON file to Add language-mapping JSON file.May 13 2021, 1:28 AM

ldelench_wmf renamed this task from Add language-mapping JSON file to Add language-mapping data.May 13 2021, 11:34 PM

ldelench_wmf updated the task description. (Show Details)

ldelench_wmf set the point value for this task to 5.

ldelench_wmf moved this task from New & TBD Tickets to Up Next (May 6-17) on the Community-Tech board.

Samwilson merged a task: T281913: Wikisource OCR: add language validation support to experimental & mapped languages.May 14 2021, 12:12 AM

Samwilson mentioned this in T281913: Wikisource OCR: add language validation support to experimental & mapped languages.

Samwilson added subscribers: ifried, • NRodriguez.

MusikAnimal claimed this task.May 27 2021, 12:11 AM

MusikAnimal edited projects, added Community-Tech (CommTech-Sprint-1); removed Community-Tech.

MusikAnimal moved this task from Ready 🎬 to In Development 💻 on the Community-Tech (CommTech-Sprint-1) board.

MusikAnimal mentioned this in T282073: Add API endpoint to retrieve supported languages.May 27 2021, 4:29 PM

MusikAnimal mentioned this in T281866: Wikimedia OCR: "-" and "_" being stripped from language codes.Jun 2 2021, 10:28 PM

PR: https://github.com/wikimedia/wikimedia-ocr/pull/33

This also addresses T281866: Wikimedia OCR: "-" and "_" being stripped from language codes and supersedes the work done for T282073: Add API endpoint to retrieve supported languages.

How the language list was built

I went by the list of existing Wikisources at https://wikisource.org/wiki/Main_Page#Languages_at_Wikisource (including those exclusive to Multilingual Wikisource)
Included all supported and experimental languages listed at https://cloud.google.com/vision/docs/languages, as well as some mapped languages (where there were Tesseract equivalents)
Included all languages supported by Tesseract: https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html

Languages that were not included

Unavailable in newer versions of Tesseract:

Danish - Fraktur (contrib) (dan_frak)
German - Fraktur (contrib) (deu_frak)
Kurdish (Arabic Script) (kur)
Slovak - Fraktur (contrib) (slk_frak)
Tagalog (new - Filipino) (tgl)

Unavailable on Tesseract, but 'mapped' on Google (these should probably be included, but we'll have to figure out what Google maps them to):

Asturianu (ast)
Balinese (ban)
Bamanankan (bm)
Bashkir (ba)
Baso Minangkabau (min)
Bemba (bem)
Diné Bizaad (nv)
Gaelg (gv)
IsiXhosa (xh)
Kajin M̧ajeļ (mh)
Lingala (ln)
Luganda (lg)
Malagasy (mg)
Mapudungung (arn)
Nāhuatl (nah)
Ndonga (ng)
Norrœnt (non)
Hawaiian (haw)
Plattdüütsch (nds)
Rumantsch (rm)
Scots (sco)
Сахалыы (sah)
Чăваш (cv)
Удмурт (udm)
𐌲𐌿𐍄𐌹𐍃𐌺 (got)

Languages listed at wikisource.org that are unsupported by both Google and Tesseract:

Alemannisch (gsw)
Aragonés (an)
Armâneascâ (rup)
Avestan (ae)
Bân-lâm-gú (nan)
Banjar (bjn)
Boarisch (bar)
Bolak (?)
بلوچی (bal)
ChiShona (sn)
Chono (?)
Crnogorski (cnr)
Davvisámegiella (se)
Dolnoserbšćina (dsb)
Estremeñu (ext)
Furlan (fur)
Gagauz (gag)
客家語 / Hak-kâ-ngî (hak)
Hinónoʼeitíít (arp)
Hornjoserbšćina (hsb)
Ido (io)
Interlingua (ia)
Istriota (ist)
Judeo-Español (lad)
Karjala (krl)
Kaszëbsczi (csb)
Kemi Sami (sjk)
Kernewek (kw)
Kurdî / كوردي (ku)
Ladin (lld)
Laraʼ (lra)
Lese (les)
Lëtzebuergesch (lb)
Ligure (lij)
Limburgs (li)
Līvõ kēļ (liv)
Livvikarjalan (olo)
Lumbaart (lmo)
Lojban (jbo)
閩北語 / Mâing-bă̤-ngṳ̌ (mnp)
閩東語 / Mìng-dĕ̤ng-ngṳ (cdo)
Mirandés (mwl)
Morisien (mfe)
Napulitano (nap)
Nordfriisk (frr)
Norn (nrn)
Ɔl Maa (mas)
Onödowága (see)
Palau (pau)
Pāḷi / पाळि (pi)
Picard (pcd)
Piemontèis (pms)
Plautdietsch (pdt)
Wenska rec (pox)
Pó-sing-gṳ̂ (cpx)
Reo Tahiti (tah)
Rheifränggisch (pfl)
Rumârește (ruo)
Sahsisk (osx)
Salırça (slr)
Sardu (sc)
Seeltersk (stq)
Sicilianu (scn)
Sukuma (suk)
Tetun (tet)
Tupynã'mbá (tpn)
Tutonish (?)
Vèneto (vec)
Vepsän kel' (vep)
Volapük (vo)
Wymysöryś (wym)
Zazaki (diq)
Walon (wa)
Аҧсуа (ab)
Адыгэбзэ (ady)
Кӣллт са̄мь кӣлл (sjd)
Кыргызча (ky)
Кырык мары (mrj)
Кърымчах (jct)
Мокшень (mdf)
Нанай (gld)
Олык марий (mhr)
Перем Коми (koi)
Романы (rml)
Слове́нскїй (chu)
Словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ (cu)
Эрзянь (myv)
Αρbε̰ρίσ̈τε (aat)
Ποντιακά (pnt)
მარგალური (xmf)
𒀝𒂵𒌈 (akk)
Ϯⲁⲥⲡⲓ ̀ⲛⲣⲉⲙ̀ⲛⲬⲏⲙⲓ (cop)
বিষ্ণুপ্রিয়া মণিপুরী (bpy)
बडो (brx)
ꓡꓲꓢꓴ (lis)
मैथिली (mai)
कांगड़ी (xnr)
ᐃᓄᒃᑎᑐᑦ / Inuktitut (iu)
文言文 (lzh)
ᡤᡳᠰᡠᠨ (mnc)
うちなーぐち (ryu)
ᡤᡞᠰᡠᠨ (sjo)
𗼇𗟲 (txg)
粵文 (yue)
عثمانلوجه (ota)
پنجابی (pnb)
ئۇيغۇرچە (ug)
Ancient Egyptian (egy)
Old Persian (peo)

ldelench_wmf added a parent task: T282052: SPIKE: Investigate complexity of language sharing in OCRs.Jun 3 2021, 5:22 PM

ldelench_wmf mentioned this in T282052: SPIKE: Investigate complexity of language sharing in OCRs.

Follow-up PR at https://github.com/wikimedia/wikimedia-ocr/pull/34

ldelench_wmf moved this task from CommTech-Sprint-1 to CommTech-Sprint-2 on the Community-Tech board.Jun 7 2021, 4:06 PM

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-2); removed Community-Tech (CommTech-Sprint-1).

ldelench_wmf moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (CommTech-Sprint-2) board.

ldelench_wmf moved this task from Backlog to 🌟Top Priority on the Wikimedia OCR board.Jun 7 2021, 9:14 PM

Everything has been merged.

QA notes: this is the same chunk of work as T282073. See above (T282760#7130811) on how I made the list of languages. This was very tedious work, but I double-checked and I think we got all relevant languages added (including Google's experimental languages, and a few "mapped" languages where there are Tesseract equivalents). I will admit that it is still possible that I made a typo or two.

The question on what to do, if anything, about the outstanding "mapped" languages still stands.

Note this work effectively solves T281866 too since we only accept ISO 639-1 now.

Tesseract

I compared our list from https://ocr-test.wmcloud.org/api/available_langs?engine=tesseract to what Tesseract claims to support in https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html.

@MusikAnimal I am unsure about some of the below.

Languages we list in our API but I am unsure about Tesseract support:

de-frk is apparently not supported in Tesseract 4.0.0. But works if I submit with this language.
fro perhaps should be frm (see https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes)? fro apparently is not supported in Tesseract
gv (or glv in ISO-639-2) is not listed by Tesseract. Get a 500 error when I try to use it.
ku (kur in ISO-639-2) not listed as supported by version 4.0.0. But it works if I submit with this language.
tl (tgl in ISO-639-2) not listed as supported by version 4.0.0. But works if I submit with this language.
zh, zh-hans, zh-hant not listed as supported by Tesseract, but chi_sim ("Chinese - Simplified") and chi_tra ("Chinese - Traditional") are. All three work if I submit them.

Languages Tesseract lists but we do not:

chi_sim, chi_tra see above. Someone who knows more about this might have to comment. I think zh-hans = Chinese Simplified and zh-hant = Chinese Traditional.
fil ("Filipino") this has no ISO-639-1 equivalent.
frk ("German - Fraktur") I could not find this as either ISO-639-1 or ISO-639-2. In ISO-639-3 it is listed as "Frankish" (https://iso639-3.sil.org/code/frk)
frm I think we mistakenly list this as fro. See above
kmr ("Kurmanji (Kurdish - Latin Script)") I cannot find this listed as ISO-639-1/2/3.

Google

I compared our list in https://ocr-test.wmcloud.org/api/available_langs?engine=google to what Google claims to support in https://cloud.google.com/vision/docs/languages.

We match all the "supported" and "experimental" languages and some of the "mapped" languages, except:

"ru-PETR1708" which is "supported" by Google, but we don't list it and I cannot find it in ISO-639-1/2/3

I found this bug while testing this: T284654.

Test environment: https://ocr-test.wmcloud.org Version 0.5.0-3-g1bf762c

In T282760#7145928, @dom_walden wrote:

Thank you for the thorough review! I needed a second set of eyes :) This was very tedious work.

MusikAnimal I am unsure about some of the below.

Sorry, I should have better explained how the mapping works. We only accept ISO 639-1 and some special edge cases. The full list is in langs,json. So entering one language code might get transformed when it's passed to the engine.

I've made noted the changes I've made based on your feedback.

de-frk is apparently not supported in Tesseract 4.0.0. But works if I submit with this language.

It's a made-up language code to be consistent with what we're used to on the wiki. Deutsch should start with de, so I went with de-frk which gets mapped to frk, which does appear to be supported by Tesseract 4.0.

fro perhaps should be frm (see https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes)? fro apparently is not supported in Tesseract

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

gv (or glv in ISO-639-2) is not listed by Tesseract. Get a 500 error when I try to use it.

I for some reason left that blank in langs.json, which is why you get a 500. It is only available as a "mapped" language in Google, so I've removed it entirely from langs.json

ku (kur in ISO-639-2) not listed as supported by version 4.0.0. But it works if I submit with this language.

https://ku.wikipedia.org suggests Wikimedia's Kurdish (ku) is a Latin script, so I have it mapped to krm for Tesseract. I have however added kur to the list, so users are able to choose Arabic script for Tesseract.

tl (tgl in ISO-639-2) not listed as supported by version 4.0.0. But works if I submit with this language.

This gets mapped to fil which is supported by Tesseract 4. I think the actual language it's supposed to be is tgl, as you say, but that is not supported by Tesseract 4, so I guess my thinking was that was better than nothing… but I could be wrong in making that assumption.

zh, zh-hans, zh-hant not listed as supported by Tesseract, but chi_sim ("Chinese - Simplified") and chi_tra ("Chinese - Traditional") are. All three work if I submit them.

All three I believe are correctly mapped in langs.json.

Languages Tesseract lists but we do not:

chi_sim, chi_tra see above. Someone who knows more about this might have to comment. I think zh-hans = Chinese Simplified and zh-hant = Chinese Traditional.

Same as previous.

fil ("Filipino") this has no ISO-639-1 equivalent.

I think it's actually tl. That's what Google uses as well as Wikimedia (see https://tl.wiktionary.org for example).

frk ("German - Fraktur") I could not find this as either ISO-639-1 or ISO-639-2. In ISO-639-3 it is listed as "Frankish" (https://iso639-3.sil.org/code/frk)

de-frk maps to frk.

frm I think we mistakenly list this as fro. See above

Yup, I've got that fixed.

kmr ("Kurmanji (Kurdish - Latin Script)") I cannot find this listed as ISO-639-1/2/3.

This is ku according to us (i.e. https://ku.wikipedia.org), which maps to kmr for Tesseract.

Google
...
We match all the "supported" and "experimental" languages and some of the "mapped" languages, except:

"ru-PETR1708" which is "supported" by Google, but we don't list it and I cannot find it in ISO-639-1/2/3

I've added it.

I found this bug while testing this: T284654.

Already fixed by Daimona! Thanks for finding that.

Additional follow-up PR at https://github.com/wikimedia/wikimedia-ocr/pull/40

PR 40 merged.

Thanks for looking into those for me @MusikAnimal.

In T282760#7147517, @MusikAnimal wrote:

In T282760#7145928, @dom_walden wrote:

fro perhaps should be frm (see https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes)? fro apparently is not supported in Tesseract

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

Does this need to be added to LANG_NAMES? It is blank at the moment.

I am not sure what the correct name is. According to https://en.wikipedia.org/wiki/Middle_French it should be françois; franceis. According to https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=146 it should be français moyen (1400-1600).

In T282760#7148008, @dom_walden wrote:

Good catch! fro and frm are different language variants. The former is supported only by Google, the latter only by Tesseract. I've corrected this.

Does this need to be added to LANG_NAMES? It is blank at the moment.

I am not sure what the correct name is. According to https://en.wikipedia.org/wiki/Middle_French it should be françois; franceis. According to https://www.loc.gov/standards/iso639-2/php/langcodes_name.php?code_ID=146 it should be français moyen (1400-1600).

Yes! Done with https://github.com/wikimedia/wikimedia-ocr/pull/43. I went with moyen français (1400-1600) per https://fr.wikipedia.org/wiki/Moyen_fran%C3%A7ais

Merged.

In T282760#7156232, @MusikAnimal wrote:

Yes! Done with https://github.com/wikimedia/wikimedia-ocr/pull/43. I went with moyen français (1400-1600) per https://fr.wikipedia.org/wiki/Moyen_fran%C3%A7ais

Thanks. As this is such a small change, I will just move this into Done.

MusikAnimal closed this task as Resolved.Jun 18 2021, 6:51 PM

Add language-mapping dataClosed, ResolvedPublic5 Estimated Story PointsActions