Page MenuHomePhabricator

Improve categorisation for languages that do not have ISO 639-3 code
Open, Needs TriagePublic

Description

For recording in languages that do not have any ISO 639-3 code, categorisation has to be improved. Currently it can be "Category:Lingua Libre pronunciation-pas applicable" (https://commons.wikimedia.org/wiki/File:LL-Q942602-Davidgrosclaude-capleugi%C3%A8r.wav) or Category:Lingua Libre pronunciation-other (https://commons.wikimedia.org/wiki/File:LL-Q36759-Assassas77-%E5%8D%B5.wav).

For these languages, an idea would be to use category like "Category:Lingua Libre pronunciation-XXX" where XXX is the name of the language in English.

In addition, Lingua Libre should check if several languages that do not ISO 639-3 code have the same name, the category should include the name of the country (it is stored in the Wikidata item of the language). Thus, it would be "Category:Lingua Libre pronunciation-XXX (YYY)" with XXX the language name in English and YYY the name of the country in English.

Event Timeline

Pamputt created this task.Nov 3 2018, 10:25 AM
Yug added a subscriber: Yug.
Yug removed 0x010C as the assignee of this task.EditedDec 23 2018, 7:59 PM
Yug added a subscriber: 0x010C.

Note: 0x10C started to use the iso code 639-3 mis in the page https://lingualibre.fr/datasets/.

mis (originally an abbreviation for 'miscellaneous') is intended for languages which have not (yet) been included in the ISO standard.

See https://en.wikipedia.org/wiki/ISO_639-3#Non-language_codes :

Given the filenames themselves with start with the language specific LinguaLibre qid, we could indeed categorize all these audio in Category:Lingua Libre pronunciation-mis : )

Sascha added a subscriber: Sascha.Jan 11 2019, 10:53 AM

Have you considered using IETF BCP 47 language tags instead of ISO 639-3? Every language with an ISO code also has an IETF code (usually the same, since IETF draws in ISO 639 among others). But other than ISO 639, you can do finer-grained distinctions with IETF tags. That’s why all the internet standards (such as HTTP, HTML, XML) use IETF BCP 47 instead of ISO 639. For example, Brazilian Portuguese, Sursilvan and Zürich German have IETF language tags but no ISO code. If LinguaLibre is asked to support languages without an IETF code, you can request the addition of a language tag.

Sascha added a subscriber: GerardM.Jan 11 2019, 8:45 PM

For languages that have no language code yet, perhaps Lingua Libre could use “mis-x-Q12345” (where Q12345 would be the Wikidata item for the language of the pronunciation audio). That would be a syntactically valid IETF BCP 47 tag, and you wouldn’t lump unrelated languages into the same category. Once the language does get a code, some bot could change the categories of uploaded files on Wikimedia Commons. @GerardM, what do you think?

What have iso codes with Commons categorisation to do? ie a non-issue as far as I am concerned

This approach seems interesting. But if we decide to change the categorisation that way, we should start using BCP47 tags instead the ISO-639-3 language code everywhere else for consistency. As it will be a major change, I'll make a RfC on LinguaLibre, to be sure there is no opposition.