Page MenuHomePhabricator

[Bug] Disallow (or resolve) dummy language codes.
Open, HighPublic

Description

MediaWiki defines some "dummy" language codes as aliases for other "real" language codes. These are defined in the $wgDummyLanguageCodes variable.

Wikibase currently treats such "dummy" codes as real language codes in some contexts, while resolving them to the "real" code in other places, leading to confusion such as T102295, where there are conflicting labels for the "no" and the "nb" codes, even though "no" is an alias for "nb".

Wikibase should either disallow such codes completely in all API input, or should resolve them to real codes. Resolving them is convenient for compatibility, but leads to unclear semantics when both codes are used. Disallowing them breaks compatibility with existing data.

See also: T101086: Standardize invalid language codes for Babel extension

Related Objects

StatusSubtypeAssignedTask
ResolvedLydia_Pintscher
DeclinedNone
Resolved adrianheine
ResolvedSmalyshev
OpenNone
OpenNone
ResolvedLydia_Pintscher
OpenNone
Resolved adrianheine
ResolvedAddshore
ResolvedTobi_WMDE_SW
Resolved adrianheine
Resolved adrianheine
ResolvedLydia_Pintscher
Resolved adrianheine
Resolved adrianheine
Declined adrianheine
Resolved adrianheine
Resolved adrianheine
OpenNone
ResolvedLucas_Werkmeister_WMDE
ResolvedAddshore

Event Timeline

daniel raised the priority of this task from to High.
daniel updated the task description. (Show Details)
daniel added subscribers: daniel, Aklapper, Danmichaelo.
JanZerebecki renamed this task from Disallow (or resolve) dummy language codes. to [Bug] Disallow (or resolve) dummy language codes..Sep 10 2015, 8:51 PM
JanZerebecki set Security to None.

Are there entries with "dummy" languages codes in Wikidata? Or are they already removed or replaced?

Are there entries with "dummy" languages codes in Wikidata? Or are they already removed or replaced?

There are plenty of items using language codes which aren't supposed to be used (e.g. over 30,000 using "simple" for a label, description or alias, almost 3000 using "no"). I have seen bots occasionally fix them, but it doesn't seem to be a regular thing.

I think the deployment should start with block adding new labels, descriptions and aliases with dummy language codes. Then the bots can fix the current entries by replacing the right language code or removing the entry when the entry with the right language code already exist.

Any progress on this?

I've now handled all the conflict cases for "no" (where the labels/descriptions differed between "no" and "nb"), so there's no more "no" labels or descriptions at Wikidata right now.

But if users are still allowed to add labels and descriptions in "no", they will soon reappear! Meaning more cleanup work… so it would be really great if creation of new labels/descriptions in "no" could be blocked!

Multichill subscribed.

I ran into this because I imported data with was tagged as "NOR" and is a valid ISO 639-2 language code that maps to the ISO 639-1 "no" language code, see https://en.wikipedia.org/wiki/Norwegian_language . Norwegian is a valid macro language, see https://en.wikipedia.org/wiki/ISO_639_macrolanguage and wouldn't be the first macro language to include, we also have ar (Arabic) or ne (Nepali) as valid language codes.

To me it looks a bit weird to exclude one macrolanguage just because we have some Wikipedia legacy here.

For Norwegian, I strongly encourage @Multichill and other Dutch users to cease "one Norwegian" glitch, rather we, just all Wikimedians around the world, should always separate them as one of

  1. Bokmål (nb)
  2. Høgnorsk (nn-hognorsk)
  3. Nynorsk (nn)
  4. Riksmål (nb-riksmal)

I don't have separated ideas regarding Arabic and Nepali.

For Norwegian, I strongly encourage @Multichill and other Dutch users to cease "one Norwegian" glitch, rather we, just all Wikimedians around the world, should always separate them as one of
<knip>

I don't like being called out in this way, please don't ever do that again. Calling a valid ISO 639-1/ISO 639-2 language a glitch is a very contentious way to describe it.

@Multichill:

a valid ISO 639-1/ISO 639-2 language

So you believe that we should let Klingon (tlh) contents back, because of https://iso639-3.sil.org/code/tlh?

Anyway

I don't like being called out in this way, please don't ever do that again.

The one who says this can see what @jeblad commented:

I'm not quite sure what kind of titles Multichill adds, but the differences in spelling and grammar can be substantial between Nynorsk and Bokmål. For example the w:en:Øye Stave Church is called w:nn:Øye stavkyrkje in Nynorsk and w:no:Øye stavkirke in Bokmål. Note the prefix at Norwegian Bokmål Wikipedia, which is the same as the macro code, but it should be "nb".
Most government agencies, government buildings , and similar, are named so the form is similar in Nynorsk and Bokmål. If they can't be named this way they have similar names in both language forms. Only the equal forms should be identified with the macro code, if used at all.

If there are benefits to save contents under macrolanguage codes, then why not Kalenjin (kln), Marwari (mwr)...?

no need to ping twn users here, this isn't twn related

Hmm, Amire should be here to clarify if there's need for "translating" variable parameteres or not

I guess the form used in T102533#4422722 is meant to be the "private use" variant. I'm not sure we should use that form at all, but if we use it then it should probably be "no-hognorsk" and "no-riksmal" as they are not subsets of either Nynorsk or Bokmål. I would rather use a variant code, like "no-hn" and "no-rm", and only after a proper process to standardize its use.

Bokmål has been adjusted to accommodate for most of Riksmål, and the Riksmål people usually use "nb" as the language code, but especially the more archaic forms of Riksmål isn't a strict subset of Bokmål. In particular some of the users at Norwegian Bokmål Wikipedia use the more archaic form.

The same holds for Høgnorsk.

I'm tempted to agree that all valid language codes, that is T102533#4422786, should be accepted. If some would like to add names and labels for Klingon, that is T102533#4422825, then let them do that.

Mapping the language code "no" to "nb" was a quick fix many years ago, that reflects the most common use case and solves the problem that "no" is a macro code for Norwegian language. There are two official written forms of Norwegian (Bokmål and Nynorsk), and two non-official written forms (Høgnorsk and Riksmål). Due to history the Norwegian Bokmål Wikipedia has been using the prefix "no" even if the correct one now seems to be "nb". The prefix is often misinterpreted as the language code, and thus user scripts and gadgets often added labels and descriptions at Wikidata with the language set to "no". The quick fix back then was to set the content language to "nb" and persuade the script editors to use the content language instead of the prefix.

A better solution could be to accept macro codes, and allow fallbacks according to language similarity. For example "Oslo" could then use a macro code "gem" and all Germanic languages had a proper label. Even if the label would be the same, the descriptions could be different. Most labels on Wikidata are identical for the North Germanic languages. The same does not hold for the descriptions.

This is an really nice idea, but it implies that languages in opposing countries (read open war in real life) would use the same codes. That could lead to on-going edit wars, which is not so fun.

And it also implies a traversal of the language hierarchy to figure out the valid language codes, ie translating a language code to a long list of valid macro codes, which could be unfeasible in actual code.

@Liuxinyu970226 and @Multichill, the language dispute in Norway has been on-going for more than a century. If we can't agree in Norway, I doubt that a few heated posts here will solve it. Lets focus on a usable solution. :D

nn-hognorsk is a valid language tag (and the IETF subtag registration for hognorsk specifies nn as the prefix). There is already a request at T148887 for supporting it for monolingual text.

There isn't a variant subtag for Riksmål as far as I can tell, so I doubt nb-riksmal (or no-riksmal) would be acceptable (unless someone gets the subtag added officially).

no-hn and no-rm would definitely not be acceptable because they clash with country codes, and while "Norwegian as used in Honduras" is highly unlikely to be needed in the future, Wikimedia has been trying to move away from incorrect uses of language codes because of the issues they cause.

If we do want no for labels, I think it would be impractical until the issues causing labels coming from nowiki to use no rather than nb are fixed.

Please note Language tags in HTML and XML and how proper language tags are generated (ie. the following pattern)

language-extlang-script-region-variant-extension-privateuse

Also note that I wrote "only after a proper process to standardize its use" in T102533#4425465.

I have no doubt that someone has registered "nn-hognorsk", as Høgnorsk resembles Nynorsk, but it is actually quite different in my opinion. Høgnorsk is claimed to be the early form Landsmål, which later become Nynorsk in 1938.

For those interested; there are 49 articles written in Høgnorsk at nnwiki.

Last I heard about it some years back was that there were no new articles written in Høgnorsk at nnwiki.

I believe this is the wrong place to discuss the pretty weird language situation in Norway, so please use some other macro language as an example! =)

@Nikki I would love to know how to request a new BCP47 language tag?
@jeblad

1a
2639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
3aka aka ak Akan Macrolanguage Living
4639-2/T: sqi
5639-2/B: alb sq Albanian Macrolanguage Living
6ara ara ar Arabic Macrolanguage Living
7aym aym ay Aymara Macrolanguage Living
8aze aze az Azerbaijani Macrolanguage Living
9
10b
11639-3 639-2/639-5 Language Name(s) Scope Language Type
12bal bal Baluchi Macrolanguage Living
13bik bik Bikol Macrolanguage Living
14bnc Bontok Macrolanguage Living
15bua bua Buriat Macrolanguage Living
16
17c
18639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
19639-2/T: zho
20639-2/B: chi zh Chinese Macrolanguage Living
21chm chm Mari (Russia) Macrolanguage Living
22cre cre cr Cree Macrolanguage Living
23
24d
25639-3 639-2/639-5 Language Name(s) Scope Language Type
26del del Delaware Macrolanguage Living
27den den Slave (Athapascan) Macrolanguage Living
28din din Dinka Macrolanguage Living
29doi doi Dogri (macrolanguage) Macrolanguage Living
30
31e
32639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
33est est et Estonian Macrolanguage Living
34
35f
36639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
37fas 639-2/T: fas
38639-2/B: per fa Persian Macrolanguage Living
39ful ful ff Fulah Macrolanguage Living
40
41g
42639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
43gba gba Gbaya (Central African Republic) Macrolanguage Living
44gon gon Gondi Macrolanguage Living
45grb grb Grebo Macrolanguage Living
46grn grn gn Guarani Macrolanguage Living
47
48h
49639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
50hai hai Haida Macrolanguage Living
51hbs sh (deprecated) Serbo-Croatian Macrolanguage Living
52hmn hmn Hmong, Mong Macrolanguage Living
53
54i
55639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
56iku iku iu Inuktitut Macrolanguage Living
57ipk ipk ik Inupiaq Macrolanguage Living
58
59j
60639-3 639-2/639-5 Language Name(s) Scope Language Type
61jrb jrb Judeo-Arabic Macrolanguage Living
62
63k
64639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
65kau kau kr Kanuri Macrolanguage Living
66kln Kalenjin Macrolanguage Living
67kok kok Konkani (macrolanguage) Macrolanguage Living
68kom kom kv Komi Macrolanguage Living
69kon kon kg Kongo Macrolanguage Living
70kpe kpe Kpelle Macrolanguage Living
71kur kur ku Kurdish Macrolanguage Living
72
73l
74639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
75lah lah Lahnda Macrolanguage Living
76lav lav lv Latvian Macrolanguage Living
77luy Luyia, Oluluyia Macrolanguage Living
78
79m
80639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
81man man Manding, Mandingo Macrolanguage Living
82639-2/T: msa
83639-2/B: may ms Malay (macrolanguage) Macrolanguage Living
84mlg mlg mg Malagasy Macrolanguage Living
85mon mon mn Mongolian Macrolanguage Living
86msa 639-2/T: msa
87639-2/B: may ms Malay (macrolanguage) Macrolanguage Living
88mwr mwr Marwari Macrolanguage Living
89
90n
91639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
92nep nep ne Nepali (macrolanguage) Macrolanguage Living
93nor nor no Norwegian Macrolanguage Living
94
95o
96639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
97oji oji oj Ojibwa Macrolanguage Living
98ori ori or Oriya (macrolanguage) Macrolanguage Living
99orm orm om Oromo Macrolanguage Living
100
101p
102639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
103639-2/T: fas
104639-2/B: per fa Persian Macrolanguage Living
105pus pus ps Pashto, Pushto Macrolanguage Living
106
107q
108639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
109que que qu Quechua Macrolanguage Living
110
111r
112639-3 639-2/639-5 Language Name(s) Scope Language Type
113raj raj Rajasthani Macrolanguage Living
114rom rom Romany Macrolanguage Living
115
116s
117639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
118sqi 639-2/T: sqi
119639-2/B: alb sq Albanian Macrolanguage Living
120srd srd sc Sardinian Macrolanguage Living
121swa swa sw Swahili (macrolanguage) Macrolanguage Living
122syr syr Syriac Macrolanguage Living
123
124t
125639-3 639-2/639-5 Language Name(s) Scope Language Type
126tmh tmh Tamashek Macrolanguage Living
127
128u
129639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
130uzb uzb uz Uzbek Macrolanguage Living
131
132y
133639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
134yid yid yi Yiddish Macrolanguage Living
135
136z
137639-3 639-2/639-5 639-1 Language Name(s) Scope Language Type
138zap zap Zapotec Macrolanguage Living
139zha zha za Chuang, Zhuang Macrolanguage Living
140zho 639-2/T: zho
141639-2/B: chi zh Chinese Macrolanguage Living
142zza zza Dimili, Dimli (macrolanguage), Kirdki, Kirmanjki (macrolanguage), Zaza, Zazaki Macrolanguage Living

@Nikki I would love to know how to request a new BCP47 language tag?

They are requested by posting a registration form to the ietf-languages@iana.org mailing list. You can subscribe to the mailing list and view the archives at https://www.ietf.org/mailman/listinfo/ietf-languages and https://tools.ietf.org/html/bcp47#appendix-B has two examples of the registration form.

Is there any evidence that this will be fixed as soon as possible? If not, then I think the priority should be downgraded to Normal/Low