Page MenuHomePhabricator

Enable all ISO 639-3 codes on Wikidata
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do):
Enable all ISO 639-3 codes on Wikidata, and also Glottolog codes if possible. Preferably, we can do this for all of Wikidata, but Wikidata:Lexicographical data is a good place to start.
Wikidata currently supports only a few hundred language codes, whereas there are about 7,000 languages spoken in the world.
This creates a situation where editors from regions such as Africa and many parts of Asia cannot add their languages to Wikidata without having to resort to the wildcard code [mis].

A list of ISO 639-3 codes imported from Etnologue (22nd edition) can be found at:
https://en.wiktionary.org/wiki/Appendix:ISO_639-3_codes

This list of codes can be imported.

Glottolog codes can be imported from:
https://glottolog.org/glottolog/language

Glottolog codes include dialects, which ISO does not support, and includes many recently described languages that have no ISO codes.

Steps to reproduce (a list of clear steps to create the situation that made you report this, including full links if applicable):

See below.

Use case(s) (describe the actual underlying problem which you want to solve, and not only a solution):

Creating Wikidata items for the world's 6,000+ little-known languages currently very clunky and discourages users from contributing data in those languages.

I tried creating a lexeme for bàbò (Nupe (Q36720) for Lagenaria siceraria (Q1277255)), but technical restrictions prevented me from doing so at first, since the ISO code [nup] is not currently supported by Wikidata. Initially, I also could not add the Nupe name to Lagenaria siceraria (Q1277255), since Wikidata items cannot be linked to Incubator pages. In the future, I would like to add lexemes for dozens of African languages that do yet have any officially launched wikis, but it appears that Wikidata cannot yet support this.

At Wikidata talk:Lexicographical data, So9q said that lexemes for which we do not yet have selectable language codes can be given "mis" as language code. He created bàbò (L585993) as a test. The template for the "create new lexeme"-page could be improved.

For example, see a list of fish and plant species names in the Day language. The goal is to enable Day, or any other language with an ISO code, to be added without having to resort to [mis].
https://en.wikipedia.org/wiki/Day_language

Event Timeline

above-mentioned ISO 639 (1, 2 or 3) and glottolog and ethnologue (https://www.ethnologue.com/) and the like indeed do have some issues, but I think some support (e.g., populating Wikidata with any one or all of them) would be preferred over the “no support but with a ‘mis’ hack or a ‘code in the spelling variant textbox’ hack after error messages in the interface in Wikidata” status quo. Or to support directly without error messages at least those languages that are in some way an official language in the country/countries where they are spoken, and at least those languages for which there is a Wikipedia already.

For instance, isiNdebele / Ndebele (ISO 639-3 nbl or nr, and nde or nd), being one of the 11 official languages in South Africa is not recognised by Wikidata to add lexeme data, even though it has Wikipedia entries (south and north, https://en.wikipedia.org/wiki/Ndebele) and a Wikidata entry as single language https://www.wikidata.org/wiki/Q13155057. And likewise (even more so?) for another official language of South Africa, Sepedi/Pedi/Northern Sotho/Sesotho sa Leboa (https://www.wikidata.org/wiki/Q33890), that does have its own Wikipedia https://nso.wikipedia.org/wiki/Letlakala_la_pele but that somehow still gives an error in the “create a new lexeme” page as an unrecognised language after I selected the language from the drop-down list (putting ‘nso’ in the spelling variant box after the error message made it work, but).
I’ve tried these with lexemes gauta and phiri (nso) and umuntu (isiNdebele).

If the Wikis want to be more inclusive, recognising languages without the interface errors and workarounds is an imperative.

(for context: I stumbled upon this issue whilst trying to add sample lexeme data to WikiData for the NLG component of Abstract Wikipedia.)

I would like to support here the idea to add all the language codes of ISO 639-3 to be supported by Wikidata (and Abstract Wikipedia). Notwithstanding @mrephabricator's comments, this standard is the de-facto used standard to enumerate all the world's languages, and the Ethnologue, on which the stanard is based, is generally accepted as a scientifically solid resource (even though it may contain some errors). The ideological background of SIL International is in my opinion irrelevant, but I must note that the claim that they have no linguistic background is completely false. In fact, this organization has conducted extensive linguistic fieldwork in numerous parts of the world, and many of its members are trained linguists, the most famous one being Kenneth Pike.

Instead of relying on a slow process to add languages, we should rely on the existing standard. If there are some specific errors we want to avoid, we can (after discussion) block the use of some codes.

Change 828887 had a related patch set uploaded (by Ariel Gutman; author: L10n-bot):

[mediawiki/extensions/WikibaseLexeme@master] Add language codes of Ndebele (Northern/Southern)

https://gerrit.wikimedia.org/r/828887