Page MenuHomePhabricator

Add a configuration variable that allows disabling language codes for labels, descriptions, and aliases
Open, HighPublicFeature

Description

Problem:
There are various lists defining language codes for Wikibase. Some values in these lists are not suitable for the termbox on Wikidata.org (codes for labels/descriptions/aliases). See T44396 for the general problem. Despite efforts to clean this up by bot, the numbers are currently at 500,000 (June 2021, see T44396#7150919).

Suggested solution:
The idea is to add a configuration variable that allows disabling language codes for the termbox that are not suitable in this context.

This doesn't touch the use of these language codes for other purposes (like lexemes or monolingual strings). For the latter we have a similar implementation already, see DifferenceContentLanguages with DefaultMonolingualTextLanguages

Example:

  • Sample: "no" isn't used on Wikidata, but is in the domain name for no.wikipedia.org . Wikidata uses just "nb".
  • Other codes for an initial version of the variable: 'bat-smg' (→'sgs'), 'bh' (→'bho'), 'fiu-vro' (→'vro'), 'roa-rup' (→'rup'), 'simple' (→'en'), 'zh-classical' (→'lzh'), 'zh-min-nan' (→'nan'), 'zh-yue' (→'yue'), 'be-x-old' (→ 'be-tarask'), 'shy' (→ 'shy-latn'), 'de-formal', 'es-formal', 'hu-formal', 'nl-informal'

Acceptance criteria:

  • there is a configuration variable that allows disabling language codes for "labels, descriptions, and aliases" (everywhere including in the API, Special:SetLabel, etc.)
  • there should be a default configuration that makes sense for Wikibase instances in general (e.g. including "simple")
  • edge cases are cared for
    • Deletion of existing disabled language code values should still be possible
    • Reverts should still be possible, even if a disabled language code was used in the old revision.
    • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de)

Original:
There are various lists defining language codes for Wikibase. Some values in these lists are not suitable for the termbox on Wikidata.org. See T44396 for the general problem.

The idea is to add a configuration variable that allows to disable such language codes. Deletion of existing values should still be possible.

  • Sample: "no" isn't used on Wikidata, but is the domain name for no.wikipedia.org . Wikidata uses just "nb".
  • Other codes for an initial version of the variable: "bat-smg", "bh", "fiu-vro", "roa-rup", "simple", "zh-classical, "zh-min-nan", "zh-yue", 'de-formal', 'es-formal', 'hu-formal', 'nl-informal',

Despite efforts to clean this up by bot, the numbers of are currently at 500,000 (June 2021, see T44396#7150919).

This doesn't touch their use for lexemes or monolingual strings. For the later, see DifferenceContentLanguages with DefaultMonolingualTextLanguages

Event Timeline

I think the list of disallowed languages can be something like:

				[
					'no', // T284808, use nb
					'be-x-old' // T284808, use be-tarask
				]

Just the list used for monolingual language code. As done in lib/includes/WikibaseContentLanguages.php.
But I think it's best for @Lydia_Pintscher and her team to say how easy it is to implement the option and if the list is usable.

Esc3300 updated the task description. (Show Details)

For backwards compatibility you may wants:

  1. We automatically rewrite some language to other ones (e.g. be-x-old to be-tarask) with a warning
  2. Throw an error if (1) in one request we have different content on be-x-old and be-tarask term; and probably (2) trying to overwrite existing content using wrong language code

If such language codes is removed entirely it may be a breaking change.

  1. We automatically rewrite some language to other ones (e.g. be-x-old to be-tarask) with a warning

Not sure if we should actually encourage using incorrect language codes. This would be a change from the current solution where this doesn't happen.

If such language codes is removed entirely it may be a breaking change.

For future code additions, this is something one has to bear in mind.

Esc3300 changed the subtype of this task from "Task" to "Feature Request".Jun 17 2021, 7:05 AM

@Lydia_Pintscher what priority should we give to this? I'd suggest Medium to High.

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Nope, there are 181k simple labels.

Don't we already have this? I don't think we allow adding terms for simple for example? Or am I missing something?

Only for monolingual codes. See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/Wikibase/+/refs/heads/master/lib/includes/WikibaseContentLanguages.php#254

I think the need was mentioned when the configuration for monolingual texts was implemented, but none followed up on it.

Manuel renamed this task from add a configuration variable that disallows some language codes for labels/descriptions/aliases (termbox) to Add a configuration variable that allows disabling language codes for the termbox.Tue, Jul 20, 9:42 AM
Manuel updated the task description. (Show Details)

@Lucas_Werkmeister_WMDE: You mentioned some edge cases that we need to consider here. In case an important edge case is still missing, could you please let me know or add it directly?

I think the edge case I was thinking of was which language would be used on nowiki when getting the label of an item in “the wiki’s language”, but it looks like that already uses the nb label (tested on my user sandbox there, though that test will stop working the next time someone resets the sandbox item).

But I have a different question: the task description keeps talking about the “termbox”; is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

Ideally the latter, though if the former is dealt with as a consequence of handling the latter then that's great too. (Perhaps I'm not the only one that uses "termbox" as a two-syllable equivalent to the nine-syllable "labels, descriptions, and aliases".)

Manuel renamed this task from Add a configuration variable that allows disabling language codes for the termbox to Add a configuration variable that allows disabling language codes for labels, descriptions, and aliases.Wed, Jul 21, 7:32 AM
Manuel updated the task description. (Show Details)

Thank you @Mahir256 and @Lucas_Werkmeister_WMDE! I have changed the description accordingly.

Looked at in story time, but not picked up today and will be tied into some other language related things in the coming week.

Manuel updated the task description. (Show Details)

Language codes uses termbox as a synonym for labels, descriptions, aliases.

Supposedly, on a practical level, disabling additions through the api would probably lead to the largest improvement for Wikidata.org

I'm not really convinced by @Nikki 's addition of today:

  • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de)

If I understand this correctly, it should do some conversion automatically.

  • If this is meant for codes used for labels/descriptions/aliases, I think this should be avoided otherwise people keep using invalid language codes. The information can still be displayed.
  • If it's merely for the GUI, this may be another task

But I have a different question: the task description keeps talking about the “termbox”; is this task specifically about the termbox user interface (either the old one or the new / mobile one), or is it supposed to prevent labels, descriptions and aliases in these language codes, everywhere (including in the API, Special:SetLabel, etc.)?

I think it needs to be everywhere, otherwise we're still going to get people adding them on Special:NewItem, via gadgets (e.g. Label Lister), via QuickStatements, via bots, etc.

I'm not really convinced by @Nikki 's addition of today:

  • When a disallowed code is used as the UI language, item labels, the page title and termbox all use the correct language instead (e.g. UI language de-formal should behave the same as de)

If I understand this correctly, it should do some conversion automatically.

  • If this is meant for codes used for labels/descriptions/aliases, I think this should be avoided otherwise people keep using invalid language codes. The information can still be displayed.
  • If it's merely for the GUI, this may be another task

UI language means the language you have selected for the user interface. You can change that clicking on the language name next to your username at the top of the page and you can also set which language you prefer on the "User profile" tab of the preferences (globally, if you wish).

Right now, UI languages are automatically label languages and the language you select as your UI language is used to select which label to show as the page title, which labels to show in links to entities and which language comes first in the termbox. This ticket involves changing it so that UI languages are no longer automatically valid label languages and therefore one of the acceptance criteria should be that it does not break when the UI language is one which can't be used for labels.

The feature request doesn't propose to change existing UI functionality.

It is something that could be improved, also for locales that aren't interface languages, e.g. uselang=fr-be, but this isn't directly linked to the feature request above.