[Story] Replace bad, but currently necessary language codes
Open, MediumPublic
Actions

Assigned To

None

Authored By

	adrianheine
	Jan 28 2016, 1:09 PM

Description

eml does not exist, it should be egl (or rgn): T36217
map-bms uses a very generic primary language subtag, it could for example use jv-bms instead
mo does not exist, it should be ro-Cyrl-md: T18889
als is described as Alemannic, but the language code for that is actually gsw and als is Tosk Albanian: T6793 and T169450
sr-ec should be sr-cyrl; sr-el should be sr-latn: T117845
nrm should rather be nrf: T25216
roa-tara uses a very generic primary language subtag, it could for example use nap-x-tara instead.
cbk-zam could be just cbk: T124657

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T124286 [Epic] Wikidata language support
		Open		None	T125073 [Story] Replace bad, but currently necessary language codes

Event Timeline

adrianheine created this task.Jan 28 2016, 1:09 PM

adrianheine raised the priority of this task from to Medium.

adrianheine updated the task description. (Show Details)

adrianheine added a project: Wikidata.

adrianheine subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2016, 1:09 PM

adrianheine mentioned this in T125063: [Task] Remove universally bad language codes from the set of language codes available for monolingual text values.Jan 28 2016, 1:09 PM

adrianheine mentioned this in T125066: [Task] Add already-requested language codes to set of language codes available for monolingual text values.Jan 28 2016, 3:37 PM

Fomafix subscribed.Jan 28 2016, 4:50 PM

There's also:

nrm - currently described as Norman, but that code is assigned to Narum. It's not clear whether Norman has its own code. The closest is nrf (Jèrriais, Guernésiais) which are two of the dialects. It was created in http://www-01.sil.org/iso639-3/chg_detail.asp?id=2014-024 where someone requested jrs for Jèrriais but ISO 639 decided against assigning a code specifically for Jèrriais because they consider it and Guernésiais to be dialects of the same language. Instead they created nrf. That implies to me that nrf is supposed to mean Norman even if that's not one of the names they list for the language.

cbk-zam - Chavacano de Zamboanga, a variety of Chavacano that doesn't have its own code or language subtag
roa-tara - Tarantino, which also doesn't have its own code or language subtag

The code for Serbian is sr (or srp for the 3-letter version, but we currently use 2-letter codes when available). src is Logudorese Sardinian. :) The labels for sr-el and sr-ec are simply "Serbian (Latin script)" and "Serbian (Cyrillic script)", I would have expected sr-latn and sr-cyrl because it doesn't say it has to be the Ekavian variant and there are no options for other variants.

The country code for Moldova is MD (MO is Macau).
The situation for mo is kinda weird. The (now closed) Moldovan Wikipedia is entirely in Cyrillic, and apparently the pages were copies of articles from the Romanian Wikipedia converted to Cyrillic, so any of the labels which came from there are ro-cyrl. Then there are a couple of thousand Latin labels for mo, most of which are identical to the current Romanian label. All the ones I've looked so far which aren't the same are cases where a bot copied ro to mo ages ago and ro was later updated. I wonder if it would actually be better to create ro-cyrl for the Cyrillic ones and merge the remaining mo things into ro? (in most cases we don't have separate variants for different countries, and in the few cases we do, they're really hard to maintain, so I would be in favour of avoiding ro-md unless it's really needed)

Thanks for your feedback, @Nikki. I added nrm to the list. As for cbk-zam and roa-tara, they could be cbk-x-zam and roa-x-tara to be valid, right? I fixed sr in the description. The comments in languages/Names.php say it's »Serbian Cyrillic ekavian« and »Serbian Latin ekavian«. I also updated ro-mo.

In general, this is not only about terms but also about monolingual text values, and the best way to handle these codes might be different in both cases.

Yeah, ckb-x-zam and roa-x-tara should be valid (I tested them on http://r12a.github.io/apps/subtags/ and it agrees).

For Serbian, even if the comments in one of the source files say it's supposed to be the Ekavian variety, I would expect users to go by what the user interface says (which doesn't seem to mention Ekavian anywhere). It'd be helpful if we could find a Serbian speaker who would know whether it really is only used for Ekavian...

I was mostly talking about terms because they're more common than monolingual text statements. :) I can't think of anything where I would expect them be treated differently though, other than the special codes (mul, zxx, etc) which don't make much sense for terms.

I remembered some more invalid codes: de-formal, nl-informal and simple. They're UI languages but occasionally people use them for content. If they stop being allowed for content, we should replace them with de, nl and en respectively. If they continue being allowed for content, simple would become en-simple, but there are no subtags for formal/informal, so I guess they would have to be something like de-x-formal and nl-x-informal.

By the way, language names are not always localised (e.g. in English nl shows up as "Dutch" but nl-informal shows up as "Nederlands (informeel)‎"), is that a bug or do they need translating somewhere? (and if so, where?)

In T125073#2001862, @Nikki wrote:

I was mostly talking about terms because they're more common than monolingual text statements. :) I can't think of anything where I would expect them be treated differently though, other than the special codes (mul, zxx, etc) which don't make much sense for terms.

From my point of view, both use cases are very different. In monolingual text values, we have to model some part of reality, so we have to be flexible enough to support everything reality throws at us. For example, if the official name of some organization is defined to be in no, we have to accept that value. On the other hand, for terms we do something on our own, and we can decide which language codes we want to use for which language. So I would expect monolingual text value to be much more permissive than terms.

I remembered some more invalid codes: de-formal, nl-informal and simple. They're UI languages but occasionally people use them for content. If they stop being allowed for content, we should replace them with de, nl and en respectively. If they continue being allowed for content, simple would become en-simple, but there are no subtags for formal/informal, so I guess they would have to be something like de-x-formal and nl-x-informal.

We just removed them in T125063.

By the way, language names are not always localised (e.g. in English nl shows up as "Dutch" but nl-informal shows up as "Nederlands (informeel)‎"), is that a bug or do they need translating somewhere? (and if so, where?)

That needs to be translated in file LocalNames/LocalNamesEn.php in the CLDR extension.

Liuxinyu970226 subscribed.Feb 14 2016, 2:32 AM

Regarding cbk-zam, there's a site renaming request T124657, I assume we can drop cbk-x-zam suggestion, then just use cbk in labels?

Regarding mo/ro-mo/ro-md/ro-cyrl see https://meta.wikimedia.org/wiki/Proposals_for_closing_projects/Deletion_of_Moldovan_Wikipedia_2

@XXN: I had seen that, although as I described above, the current situation in Wikidata is different, since we have a mixture of Latin script and Cyrillic script terms for mo (where the Latin ones largely come from a bot copying the ro label and the Cyrillic ones largely come from mowiki page names). What do you think of what I proposed? (move Cyrillic terms to ro-cyrl, merge Latin terms with ro)

@Nikki Are there cases where exists mo terms in Latin script, but no ro terms? I thinks no, so such mo terms can safely be removed.
In any case, anything merged from mo to ro needs verification by native Romanian speakers.

Are we forced to change these lang. codes right now? I ask because the current proposal for deletion of Moldovan Wikipedia can help us a lot to decide what to do. The normal and expected result of that proposal for deletion is deletion of all mo Wikimedia projects and then Wikidata lang code mo with all his values can be deleted all at once.

There is a little problem with moving mo terms to ro-cyrl: Romanian Cyrillic alphabet was used before 1862 and it is *not the same* as the Moldovan Cyrillic alphabet used between 1924-1989. It's not recommended to do this move.

By me, the best way is to wait a decision on proposal for deletion of Moldovan projects. Anyway, until that moment there exists several wikimedia projects with mo subdomain lang code and a synchronization between sitelinks code and label-description-alias code is necessary.

@XXN First, there is no urgency whatsoever. I don't currently plan on doing this story, it's just for future reference. Second, even if I would do this change, existing data on Wikidata.org would continue to work, we would just prevent saving of these language codes.

hoo subscribed.Mar 6 2016, 5:31 PM

Danny_B added a project: Story.May 23 2016, 11:43 AM

roa-tara would more specifically be nap-x-tara, since it is a dialect of Neapolitan.

cscott updated the task description. (Show Details)Jun 29 2018, 6:14 PM

Fomafix updated the task description. (Show Details)Jun 30 2018, 10:42 AM

I've got a patch to fix the BCP 47 mappings in core: https://gerrit.wikimedia.org/r/442200

I'm hoping that if/when that's merged, we can remove some of the redundancy in wikibase and have wikidata just use the core code to do the remappings.

cscott updated the task description. (Show Details)Sep 25 2018, 8:08 PM

Liuxinyu970226 updated the task description. (Show Details)Oct 14 2018, 2:22 AM

Glrx mentioned this in T279874: SVG language tag als reported as Swiss German; gsw is Swiss German..Apr 11 2021, 11:19 PM

Glrx mentioned this in T271595: SVG translate tool replaces all fields with "$1" (style element needs at least one trailing character).Apr 20 2021, 7:16 PM

Esc3300 added a project: Language codes.Jun 11 2021, 2:29 PM

Nikki mentioned this in T321852: Add language codes sr-cyrl and sr-latn on Wikidata.Oct 27 2022, 6:11 PM

Winston_Sung moved this task from Backlog to MediaWiki core on the Language codes board.Apr 19 2023, 4:54 PM

mrephabricator subscribed.Jul 13 2023, 4:58 PM