Page MenuHomePhabricator

Add English names for languages which don't yet have one
Closed, ResolvedPublic

Description

There are a number of languages which currently don't display an English name when used on Wikidata, e.g. many of the examples on this page.

Could the following English names be added?

  • abe: "Western Abenaki"
  • ady-cyrl: "Adyghe (Cyrillic script)"
  • aeb-arab: "Tunisian Arabic (Arabic script)"
  • aeb-latn: "Tunisian Arabic (Latin script)"
  • azb: "South Azerbaijani"
  • bxr: "Buryat" [added as "Russia Buriat"]
  • dty: "Doteli"
  • ett: "Etruscan"
  • fkv: "Kven"
  • lbe: "Lak"
  • kbd-cyrl: "Kabardian (Cyrillic script)"
  • ko-kp: "Korean (North Korea)"
  • koy: "Koyukon"
  • ku-arab: "Kurdish (Arabic script)"
  • lld: "Ladin"
  • mo: "Moldovan"
  • moe: "Montagnais"
  • nl-informal: "Dutch (informal address)"
  • nys: "Noongar"
  • nod: "Northern Thai"
  • otk: "Old Turkish"
  • roa-tara: "Tarantino"
  • rwr: "Marwari (India)"
  • shi-latn: "Tachelhit (Latin script)"
  • shi-tfng: "Tachelhit (Tifinagh script)"
  • sje: "Pite Sami"
  • tzl: "Talossan"
  • zh-mo: "Chinese (Macau)"
  • zh-my: "Chinese (Malaysia)"

The special code "mis" is also missing an English name. http://www-01.sil.org/iso639-3/documentation.asp?id=mis calls it "Uncoded languages" but perhaps something like "other language" or "unsupported language" would be better for the way it's used in Wikidata.

Also, while I'm requesting updates, I think the following three should be changed:

  • bbc-latn: Change "Batak Toba" to "Batak Toba (Latin script)"
  • gan-hans: Change "Simplified Gan script" to "Gan (Simplified)"
  • gan-hant: Change "Traditional Gan script" to "Gan (Traditional)"

"Batak Toba" is currently used as the name for both bbc and bbc-latn so they aren't distinguishable. All other -latn codes include "(Latin script)" in the name.

For Gan, the current names sound really odd (the phrasing "... script" is used for script names, not languages). We normally put script information in brackets after the language name, so that's what I've suggested here. It would also match the way cjy-hans/cjy-hant are named in this file.

Event Timeline

thiemowmde subscribed.

Language names are managed in a project called CLDR, see http://cldr.unicode.org. MediaWiki, UniversalLanguageSelector, Wikibase and so on are using this via a tiny extension (https://www.mediawiki.org/wiki/Extension:CLDR). The preferred way of adding missing language names is by filling a ticket at http://unicode.org/cldr/trac. I did that once with no problem, and would like to encourage you to do as well.

The second way is to do changes to the file https://phabricator.wikimedia.org/diffusion/ECLD/browse/master/LocalNames/LocalNamesEn.php via Gerrit patches. This can be a temporary workaround as long as there is no new CLDR version with the requested changes released. (Don't forget to report stuff at CLDR, and link to the ticket in your Gerrit patch or an inline comment.)

Additionally, MediaWiki does have a setting called "…ExtraLanguageNames". We are adding a few language names especially for Wikidata. See https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config/InitialiseSettings.php;1fd6734383b393eedf0004cc59f33f388ca89c5a$16360. I will paste the relevant snippet here:

'abe' => 'wôbanakiôdwawôgan', // T150633
'din' => 'dinka',           // T75563
'kea' => 'Kabuverdianu',    // T127435
'nod' => 'ᨣᩴᩤᨾᩮᩥᩬᨦ',            // T93880
'ota' => 'لسان توركى',      // T59342
'rwr' => 'मारवाड़ी',           // T61905
'sje' => 'bidumsámegiella', // T146707
'smj' => 'julevsámegiella', // T146707

As you can see this does not add the English names of these languages, but the name in the language itself.

I had a look at that tracker and found http://unicode.org/cldr/trac/ticket/9137 where two of the codes here (fkv and sje) were already requested but rejected. Given the following comment, it seems like it would be a waste of time to request the addition of more languages there:

We agreed to document there is no intent for CLDR to have the English names of all languages (there are over 7,000) of them, and point to ​http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry as a source for any extra ones that people need.

More detailed comments:

nl-informal and roa-tara are Wikimedia inventions, so they would definitely need to go into LocalNamesEn.php.

bxr and mo also seem to have been rejected a few years ago in http://unicode.org/cldr/trac/ticket/6763

All of the language-only codes I listed are in the subtag registry mentioned in the comment, but that only provides English names so adding support for that like they suggest might not be worth the effort (versus just adding the ones we need to LocalNamesEn.php).

The country variants (ko-kp, zh-mo, zh-my) could be generated from the language name we already have plus the country name from CLDR. There's also kk-cn, kk-kz, kk-tr, zh-cn, zh-hk, zh-sg, zh-tw (which are already in LocalNamesEn.php). That seems like a good idea, because then they would be automatically translated into lots of languages instead of most languages falling back to English or the native name.

Theoretically, the script variants (ady-cyrl, aeb-arab, aeb-latn, kbd-cyrl, ku-arab, shi-latn, shi-tfng, plus 34 others already in LocalNamesEn.php) could also be generated, but the script names in CLDR do not include the word "script". That's not ideal because some of the scripts share the same name as a language (e.g. ku-arab would become "Kurdish (Arabic)", the meaning of which is not very clear).

Thanks for adding zh-mo, since there's some differences between Hong Kong and Macau words pointed on zhwiki.

I doubt if "Chinese (Malaysia)" (zh-my) is still useful, since there's unlikely having difference between this and Singaporean (zh-sg) (if someone could point that I will thank to them too), maybe it's not worth to drop zh-my? I have no enough time on it.

@Lydia_Pintscher Is Shizhao's action above valid? The main topic of this task looks like about missing English names of WD language tags (which therefore this fits MediaWiki-extensions-CLDR ).

it seems like it would be a waste of time to request the addition of more languages there

That's incorrect. When I asked on behalf of Wikimedia, the names were added (maybe because I'm Wikimedia's CLDR ST manager?). CLDR's position is simply that they add names *only if* member orgs intend to actually use them and translate them (or at least one of them).

I just need a list of language names which

  • have a standard language code (not some Wikimedia-specific thing) and
  • are actually in use (for Wikidata, this could mean having at least a few thousands labels/descriptions in that language, I guess?).

Then I will ask the addition to CLDR.

That's incorrect. When I asked on behalf of Wikimedia, the names were added (maybe because I'm Wikimedia's CLDR ST manager?).

Wasn't http://unicode.org/cldr/trac/ticket/9137 your request? The codes requested there were not added. :/

Anyway, see P5634 for the current number of labels/descriptions/aliases for valid IETF language tags which don't appear to be in CLDR (excluding als, bh and nrm, which are not used for the right language). Those are extracted from this query: https://quarry.wmflabs.org/query/19791.

I used http://www.unicode.org/cldr/charts/dev/by_type/locale_display_names.languages__a-d_.html (and the other pages for the rest of the alphabet) as a reference for which languages are in CLDR. If that's not the right thing to use, then please let me know what I should be looking at instead.

I wasn't able to find a way to query for monolingual text languages (too slow for the query service).

Wasn't http://unicode.org/cldr/trac/ticket/9137 your request? The codes requested there were not added. :/

My request was accepted, then another user came and added some confusion. It's important to file carefully-scoped tickets and not submit a mess to CLDR.

Anyway, see P5634 for the current number of labels/descriptions/aliases for valid IETF language tags which don't appear to be in CLDR (excluding als, bh and nrm, which are not used for the right language). Those are extracted from this query: https://quarry.wmflabs.org/query/19791.

Thanks. Based on this, I think a reasonable first step would be to submit languages used over 10k times; plus languages which are included in MediaWiki core and have at least some translations (maybe over 10 % in core messages?). Some of the other language codes are just experiments, whose viability is under test.

I used http://www.unicode.org/cldr/charts/dev/by_type/locale_display_names.languages__a-d_.html (and the other pages for the rest of the alphabet) as a reference for which languages are in CLDR. If that's not the right thing to use, then please let me know what I should be looking at instead.

This is correct, yes. Personally I usually use the table at http://www.unicode.org/cldr/charts/latest/summary/root.html

Change 424556 had a related patch set uploaded (by Raimond Spekking; owner: Raimond Spekking):
[mediawiki/extensions/cldr@master] Add English names for languages which don't yet have one

https://gerrit.wikimedia.org/r/424556

Pginer-WMF triaged this task as Medium priority.Apr 11 2018, 3:52 PM

Change 424556 merged by jenkins-bot:
[mediawiki/extensions/cldr@master] Add English names for languages which don't yet have one

https://gerrit.wikimedia.org/r/424556

Change 511048 had a related patch set uploaded (by Siebrand; owner: Siebrand):
[mediawiki/extensions/cldr@master] Add 8 languages to be used by structured data but not in CLDR

https://gerrit.wikimedia.org/r/511048

Change 511048 merged by jenkins-bot:
[mediawiki/extensions/cldr@master] Add 8 languages to be used by structured data but not in CLDR

https://gerrit.wikimedia.org/r/511048