Page MenuHomePhabricator

[Story] Replace bad, but currently necessary language codes
Open, MediumPublic

Description

  • eml does not exist, it should be egl (or rgn): T36217
  • map-bms uses a very generic primary language subtag, it could for example use jv-bms instead
  • mo does not exist, it should be ro-Cyrl-md: T18889
  • als is described as Alemannic, but the language code for that is actually gsw and als is Tosk Albanian: T6793 and T169450
  • sr-ec should be sr-cyrl; sr-el should be sr-latn: T117845
  • nrm should rather be nrf: T25216
  • roa-tara uses a very generic primary language subtag, it could for example use nap-x-tara instead.
  • cbk-zam could be just cbk: T124657

Event Timeline

adrianheine raised the priority of this task from to Medium.
adrianheine updated the task description. (Show Details)
adrianheine added a project: Wikidata.
adrianheine subscribed.

There's also:

  • nrm - currently described as Norman, but that code is assigned to Narum. It's not clear whether Norman has its own code. The closest is nrf (Jèrriais, Guernésiais) which are two of the dialects. It was created in http://www-01.sil.org/iso639-3/chg_detail.asp?id=2014-024 where someone requested jrs for Jèrriais but ISO 639 decided against assigning a code specifically for Jèrriais because they consider it and Guernésiais to be dialects of the same language. Instead they created nrf. That implies to me that nrf is supposed to mean Norman even if that's not one of the names they list for the language.
  • cbk-zam - Chavacano de Zamboanga, a variety of Chavacano that doesn't have its own code or language subtag
  • roa-tara - Tarantino, which also doesn't have its own code or language subtag

The code for Serbian is sr (or srp for the 3-letter version, but we currently use 2-letter codes when available). src is Logudorese Sardinian. :) The labels for sr-el and sr-ec are simply "Serbian (Latin script)" and "Serbian (Cyrillic script)", I would have expected sr-latn and sr-cyrl because it doesn't say it has to be the Ekavian variant and there are no options for other variants.

The country code for Moldova is MD (MO is Macau).
The situation for mo is kinda weird. The (now closed) Moldovan Wikipedia is entirely in Cyrillic, and apparently the pages were copies of articles from the Romanian Wikipedia converted to Cyrillic, so any of the labels which came from there are ro-cyrl. Then there are a couple of thousand Latin labels for mo, most of which are identical to the current Romanian label. All the ones I've looked so far which aren't the same are cases where a bot copied ro to mo ages ago and ro was later updated. I wonder if it would actually be better to create ro-cyrl for the Cyrillic ones and merge the remaining mo things into ro? (in most cases we don't have separate variants for different countries, and in the few cases we do, they're really hard to maintain, so I would be in favour of avoiding ro-md unless it's really needed)

adrianheine set Security to None.

Thanks for your feedback, @Nikki. I added nrm to the list. As for cbk-zam and roa-tara, they could be cbk-x-zam and roa-x-tara to be valid, right? I fixed sr in the description. The comments in languages/Names.php say it's »Serbian Cyrillic ekavian« and »Serbian Latin ekavian«. I also updated ro-mo.

In general, this is not only about terms but also about monolingual text values, and the best way to handle these codes might be different in both cases.

Yeah, ckb-x-zam and roa-x-tara should be valid (I tested them on http://r12a.github.io/apps/subtags/ and it agrees).

For Serbian, even if the comments in one of the source files say it's supposed to be the Ekavian variety, I would expect users to go by what the user interface says (which doesn't seem to mention Ekavian anywhere). It'd be helpful if we could find a Serbian speaker who would know whether it really is only used for Ekavian...

I was mostly talking about terms because they're more common than monolingual text statements. :) I can't think of anything where I would expect them be treated differently though, other than the special codes (mul, zxx, etc) which don't make much sense for terms.

I remembered some more invalid codes: de-formal, nl-informal and simple. They're UI languages but occasionally people use them for content. If they stop being allowed for content, we should replace them with de, nl and en respectively. If they continue being allowed for content, simple would become en-simple, but there are no subtags for formal/informal, so I guess they would have to be something like de-x-formal and nl-x-informal.

By the way, language names are not always localised (e.g. in English nl shows up as "Dutch" but nl-informal shows up as "Nederlands (informeel)‎"), is that a bug or do they need translating somewhere? (and if so, where?)

I was mostly talking about terms because they're more common than monolingual text statements. :) I can't think of anything where I would expect them be treated differently though, other than the special codes (mul, zxx, etc) which don't make much sense for terms.

From my point of view, both use cases are very different. In monolingual text values, we have to model some part of reality, so we have to be flexible enough to support everything reality throws at us. For example, if the official name of some organization is defined to be in no, we have to accept that value. On the other hand, for terms we do something on our own, and we can decide which language codes we want to use for which language. So I would expect monolingual text value to be much more permissive than terms.

I remembered some more invalid codes: de-formal, nl-informal and simple. They're UI languages but occasionally people use them for content. If they stop being allowed for content, we should replace them with de, nl and en respectively. If they continue being allowed for content, simple would become en-simple, but there are no subtags for formal/informal, so I guess they would have to be something like de-x-formal and nl-x-informal.

We just removed them in T125063.

By the way, language names are not always localised (e.g. in English nl shows up as "Dutch" but nl-informal shows up as "Nederlands (informeel)‎"), is that a bug or do they need translating somewhere? (and if so, where?)

That needs to be translated in file LocalNames/LocalNamesEn.php in the CLDR extension.

Regarding cbk-zam, there's a site renaming request T124657, I assume we can drop cbk-x-zam suggestion, then just use cbk in labels?

@XXN: I had seen that, although as I described above, the current situation in Wikidata is different, since we have a mixture of Latin script and Cyrillic script terms for mo (where the Latin ones largely come from a bot copying the ro label and the Cyrillic ones largely come from mowiki page names). What do you think of what I proposed? (move Cyrillic terms to ro-cyrl, merge Latin terms with ro)

@Nikki Are there cases where exists mo terms in Latin script, but no ro terms? I thinks no, so such mo terms can safely be removed.
In any case, anything merged from mo to ro needs verification by native Romanian speakers.

Are we forced to change these lang. codes right now? I ask because the current proposal for deletion of Moldovan Wikipedia can help us a lot to decide what to do. The normal and expected result of that proposal for deletion is deletion of all mo Wikimedia projects and then Wikidata lang code mo with all his values can be deleted all at once.

There is a little problem with moving mo terms to ro-cyrl: Romanian Cyrillic alphabet was used before 1862 and it is *not the same* as the Moldovan Cyrillic alphabet used between 1924-1989. It's not recommended to do this move.

By me, the best way is to wait a decision on proposal for deletion of Moldovan projects. Anyway, until that moment there exists several wikimedia projects with mo subdomain lang code and a synchronization between sitelinks code and label-description-alias code is necessary.

@XXN First, there is no urgency whatsoever. I don't currently plan on doing this story, it's just for future reference. Second, even if I would do this change, existing data on Wikidata.org would continue to work, we would just prevent saving of these language codes.

roa-tara would more specifically be nap-x-tara, since it is a dialect of Neapolitan.

I've got a patch to fix the BCP 47 mappings in core: https://gerrit.wikimedia.org/r/442200

I'm hoping that if/when that's merged, we can remove some of the redundancy in wikibase and have wikidata just use the core code to do the remappings.