Page MenuHomePhabricator

duplicate/invalid language codes
Open, NormalPublic

Details

Reference
bz42396

Related Objects

Event Timeline

bzimport raised the priority of this task from to Normal.
bzimport set Reference to bz42396.
bzimport added a subscriber: Unknown Object (MLST).
Merl created this task.Nov 23 2012, 7:55 PM

This bug is probably too general to be useful (perhaps transform into a tracking bug?), but as we have another equally general report let me copy it here:


Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. Typical problems are:

  • Malformed language codes ('tokipona')
  • Correctly formed language codes without any official meaning (e.g., 'cbk-zam')
  • Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!)
  • Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both)
  • Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
  • Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).

Small update: I went through the language list at

https://github.com/mkroetzsch/wda/blob/master/includes/epTurtleFileWriter.py#L472

and added a number of TODOs to the most obvious problematic cases. Typical problems are:

  • Malformed language codes ('tokipona')
  • Correctly formed language codes without any official meaning (e.g., 'cbk-zam')
  • Correctly formed codes with the wrong meaning (e.g., 'sr-ec': Serbian from Ecuador?!)
  • Language codes with redundant information (e.g., 'kk-cyrl' should be the same as 'kk' according to IANA, but we have both)
  • Use of macrolanguages instead of languages (e.g., "zh" is not "Mandarin" but just "Chinese"; I guess we mean Mandarin; less sure about Kurdish ...)
  • Language codes with incomplete information (e.g., "sr" should be "sr-Cyrl" or "sr-Latn", both of which already exist; same for "zh" and "zh-Hans"/"zh-Hant", but also for "zh-HK" [is this simplified or traditional?]).
matej_suchanek set Security to None.
Restricted Application added a project: Wikidata. · View Herald TranscriptJul 20 2015, 2:36 PM
Fomafix reopened this task as Open.Jul 21 2015, 1:11 PM
Fomafix added a subscriber: Fomafix.

Reopened. It is not fixed. It is still possible to add unwanted values via API:

It seems fixed to me. I just made this edit with uselang=be-x-old: https://www.wikidata.org/w/index.php?title=Q1&diff=112330190&oldid=112313552

Here you got the correct language code be-tarask because MediaWiki core converts the URL parameter uselang=be-x-old to the user interface language be-tarask.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2015, 1:11 PM
Koavf added a subscriber: Koavf.Aug 25 2015, 6:36 AM

No. T39459 request to restrict the user interface to known languages. The database is already restricted to known languages. Except for duplicate language codes like als/gsw, be-x-old/be-tarask, ...

This task requests to restrict the database even with API requests to disallow unwanted language codes that are defined in wgDummyLanguageCodes.

T39459 request to restrict the user interface to known languages

No, it doesn't.

It is not possible to add a label/description/alias with language code xyzzy. Neither via GUI nor via API. When you try to do it, you get an error message.

When you open the GUI with uselang=xyzzy you get a UI which gives you input elements for label/description/alias in language xyzzy. You showed me exactly this in T39459#1468881.

Esc3300 added a subscriber: Esc3300.Nov 7 2016, 7:40 PM

Can we close this and just make new tasks for anything that is still outstanding?

Can we close this and just make new tasks for anything that is still outstanding?

You can not close this task as resolved, because it is not solved. But you can merge this task with a similar task for example with T102533: [Bug] Disallow (or resolve) dummy language codes..

For solving this task several subtasks are necessary. The first task should be disallowing adding new entries with deprecated language codes.

Liuxinyu970226 added a comment.EditedNov 11 2016, 12:49 AM

! In T44396#2787104, @Fomafix wrote:

You can not close this task as resolved, because it is not solved. But you can merge this task with a similar task for example with T102533: [Bug] Disallow (or resolve) dummy language codes..

For solving this task several subtasks are necessary. The first task should be disallowing adding new entries with deprecated language codes.

Good point! But

You can not create a relationship to object "PHID-TASK-hubexo6f7fq5spgvmjqd" because objects can not be related to themselves.

...

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptNov 11 2016, 12:51 AM
Liuxinyu970226 added a comment.EditedNov 11 2016, 3:49 PM

<del>Then what's MLST here?</del>

NOTE: This is tracked at: T122677

From bzimport added a subscriber: Unknown Object (MLST).? Probably wikibugs-l, IIRC they removed this functionality from Phabricator

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 25 2017, 3:29 AM

So there are currently more than 30,000 invalid terms in Wikidata, mostly in als, es-formal, no and simple. Doing cleanup again and again is pointless.

Language codes in question:

als
bat-smg
bh
de-formal
es-formal
fiu-vro
hu-formal
nl-informal
no
roa-rup
simple
zh-classical
zh-min-nan
zh-yue

They are all supported by MediaWiki but should be blacklisted in Wikibase.