Page MenuHomePhabricator

[Story] Show all available languages in monolingual text value's suggester
Closed, ResolvedPublic8 Estimated Story Points

Assigned To
Authored By
adrianheine
Jan 26 2016, 10:05 AM
Referenced Files
F34121977: Screenshot from 2021-02-25 09-53-12.png
Feb 25 2021, 8:55 AM
F34121970: image.png
Feb 25 2021, 8:42 AM
F34121972: image.png
Feb 25 2021, 8:42 AM
F11177787: Screenshot_20171204_172808.png
Dec 4 2017, 4:31 PM
F11177785: Screenshot_20171204_172854.png
Dec 4 2017, 4:31 PM
Tokens
"Love" token, awarded by Lokal_Profil."Pterodactyl" token, awarded by Liuxinyu970226."Pterodactyl" token, awarded by Charlie_WMDE.

Description

As an editor I want to enter values for Properties with datatype monolingual text in any available language in order to record complete data.

Problem:
The language suggester for monolingual text does not show some accepted languages in its dropdown despite it being possible to save statements with these values. This is confusing for users.

Example:
You can store statements for monolingual text values with language code cho but it is not shown in the dropdown when entering the language code.

Screenshots/mockups:

Screenshot_20171204_172854.png (252×805 px, 26 KB)

Screenshot_20171204_172808.png (147×819 px, 11 KB)

BDD
GIVEN a special language code
WHEN entering a monolingual text value
AND entering the special language code in the language field
THEN it is recognized
AND shows up in the suggester

Acceptance criteria:

  • all accepted language codes show up in the dropdown for monolingual text values
  • at least the language code is displayed, if possible the autonym (the language name in the language itself) as well or ideally the translated language name + code

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I can't think of any place where we would show only language codes. Usually it is one of these:

  1. autonyms only
    • E.g. Universal Language Selector, Interlanguge list, translatable pages
  2. language code + translated language name with fallback to autonym
    • E.g. action=info, Special:PageLanguage
  3. language code + autonym
    • E.g. Advanced search beta feature, Special:Preferences

Also, falling back to English is not foolproof, as it might also be not available until added.

Not sure if anyone is still tracking this, but ran into this today, and doesn't seems to work at all with any Native American languages.

@Amire80 @Sascha Do any of you have experience with adding languages/language names in CLDR? Is that a complex or long process?

I reported a few CLDR issues, and some of them were resolved, but I can't say I'm exceptionally good at getting them to resolve my issues or at adding new languages. I think that @Nemo_bis may be more experienced in this particular area, however.

@Amire80 @Sascha Do any of you have experience with adding languages/language names in CLDR? Is that a complex or long process?

I am not either of those people, but my comment at T151269#2822033 seems relevant here. CLDR have already rejected some of our requests because they don't want to add lots of language names. There's a suggestion at T168799 to create our own extension instead.

The easiest way to add a new language to CLDR is preparing ‘seed’ files in XML format;

When reading data from CLDR, consider injecting the English names from the IANA language subtag registry as a fallback when a language is missing from CLDR. That would immediately give at least an English name to every language in existence (provided it has an ISO/IETF language code). Another good data source for enriching CLDR might be Wikidata, via property P305 (IETF language tag).

Disclaimer: I volunteer at Unicode CLDR and am the maintainer for some minor parts of its codebase. So in my personal experience, the process has been super smooth.. :-)

@Nikki, can you send me your CLDR tickets that got rejected? I’d like to understand the reason, it sounds surprising.

Thank you all for your feedback!
@Sascha Your experience with CLDR could definitely be useful for Wikidata, since we're struggling with displaying names of languages that are not entered yet in CLDR.

For example, Numidian (nxm) has been added as an available language for monolingual text in Wikidata, but when I try to use it it's not appearing in the suggestion list, causing confusion for users who may think that the language is not available.

(quick way to test it: go to the sandbox, add a new statement with the property title, then enter a test value and finally type "nxm" or "num" in the language field that appears: Numidian is not suggested. However, if you type nxm and save the statement, it's correctly saved and "Numidian" is displayed)

The code "nxm" seems to be unavailable in CLDR https://www.unicode.org/repos/cldr/trunk/seed/main/nxm.xml

I'd like to make an experiment with this example: try adding this language to CLDR, and see if this action solves our problem on Wikidata. Would anyone be willing to try submitting data about Numidian to CLDR? :)

Sure, but it will take a while until the next official release of CLDR so you'd have to read the CLDR data from the development branch ("trunk"). I do wonder, though, if you could read the IANA registry in addition to CLDR and use IANA as fallback for the English names when CLDR has no data yet. Then, you would immediately get an English name for every language with an ISO 639 or IETF BCP 47 code, so you'd add support for a couple thousand languages at once.

Can you point me to the source repository where you are currently reading CLDR?

@Nikki, can you send me your CLDR tickets that got rejected? I’d like to understand the reason, it sounds surprising.

The ones I'm aware of are the ones I mentioned in T151269#2822033 and the comment directly after it.

Oh, all you need from CLDR is an English label? Nothing else? In that case, this Wikidata query might be helpful:

SELECT ?code ?itemLabel
WHERE  {
  ?item wdt:P305 ?code
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

It shouldn't be difficult to write a script that fetches the current CLDR data file and patches it with labels from Wikidata. (Also in other languages than English). Even easier might be to change the source code of the tool that you're currently using to read CLDR; is that tool publicly available?

In T124758#4126411, thiemowmde wrote:

A temporary workaround could be to add the additional language codes to the list of suggested languages. These entries would only show the code, but no language name. That's obviously far from perfect, but much better than nothing. Have a look at the JavaScript class wikibase.WikibaseContentLanguages. It currently simply returns the UniversalLanguageSelector's language list, but excludes a few that are also excluded in the backend (see WikibaseRepo::getMonolingualTextLanguages). This means there is already some duplication going on in the backend and frontend! This duplication could either be expanded, or resolved by introducing a MediaWiki-ResourceLoader module that returns the list of languages allowed in monolingual values.

A list of monolingual language codes is now available, though not as a ResourceLoader module, but via the action API, as meta=wbcontentlanguages. And as far as I can tell, we actually have English names for all those languages:

$ curl -G -s \
    -d action=query \
    -d meta=wbcontentlanguages \
    -d wbclcontext=monolingualtext \
    -d wbclprop='code|name' \
    -d format=json \
    -d formatversion=2 \
    https://www.wikidata.org/w/api.php | \
  jq -c '.query.wbcontentlanguages | .[] | select(.name == null)' | \
  wc -l
0

It looks like @Raymond periodically adds them to our CLDR MediaWiki extension, as an addition to the upstream CLDR data (example change). I don’t know why they’re not displayed on the monolingual statements themselves (example statement, currently shows “sjn” instead of “Sindarin”), but we seem to have them in some form or other. (I guess this also answers @Sascha’s last question?)

It looks like @Raymond periodically adds them to our CLDR MediaWiki extension, as an addition to the upstream CLDR data (example change). I don’t know why they’re not displayed on the monolingual statements themselves (example statement, currently shows “sjn” instead of “Sindarin”), but we seem to have them in some form or other. (I guess this also answers @Sascha’s last question?)

Yes, I monitor addition of new languages and add them to CLDR as soon as possible. In your example I see the word "Sindarin". But not while typing the language name or language code into the input field. I am not sure of this is a regression/new bug.

Btw: My (more or less) complete test item is https://test.wikidata.org/wiki/Q149653

In your example I see the word "Sindarin".

Ah – after I purged the English page, I see “Sindarin” as well, so that’s actually working, it was just cached from before your addition. I assume you were looking at the page in German, and the German version wasn’t cached yet.

But not while typing the language name or language code into the input field.

Yes, that’s what this bug is about :) we currently don’t have those extra language codes and names client-side.

Change 425785 abandoned by Thiemo Kreuz (WMDE):
[mediawiki/extensions/Wikibase@master] [WIP] Expose additional monolingual languages to LanguageSelector

Reason:

https://gerrit.wikimedia.org/r/425785

There are even a bunch of languages we can add labels for which don't show up in the list, despite not being explicitly excluded, e.g. aa, cho, dag, es-419, ho, hz, ng, rn, shi-latn, uz-cyrl, uz-latn...

Task Inspection note:
The list of languages without a translated language name is here https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/711874256d9d00d5cd3ed1e2d3e82391aaac735c/lib/includes/WikibaseContentLanguages.php#84

In order to send the language codes to JS we could maybe use the resource loader packageFiles mechanism: https://www.mediawiki.org/wiki/ResourceLoader/Package_files#Generated_content

Since the language dropdown for lexeme senses uses the same list, that has the same problem. The codes I mentioned in #6753045 don't show up, nor do any lexeme-specific languages like ctg, fro, nrf-je, az-cyrl.

Screenshot from @Masssly:

Dropdown list in Gloss language codes for Senses does not show Dagbanli, but accepts the "dag" code anyway. It works is not a problem when adding Forms. (2×2 px, 214 KB)

Change 665145 had a related patch set uploaded (by Jakob; owner: Jakob):
[data-values/value-view@master] LanguageSelector.tests: refactor for readability

https://gerrit.wikimedia.org/r/665145

Change 665146 had a related patch set uploaded (by Jakob; owner: Jakob):
[data-values/value-view@master] LanguageSelector: make language names optional, but not languages

https://gerrit.wikimedia.org/r/665146

Change 665309 had a related patch set uploaded (by Jakob; owner: Jakob):
[mediawiki/extensions/Wikibase@master] Show all available languages in monolingual text lang suggester

https://gerrit.wikimedia.org/r/665309

Change 665315 had a related patch set uploaded (by Jakob; owner: Jakob):
[mediawiki/extensions/WikibaseLexeme@master] Show all available languages in Gloss lang suggester

https://gerrit.wikimedia.org/r/665315

Change 665145 merged by jenkins-bot:
[data-values/value-view@master] LanguageSelector.tests: refactor for readability

https://gerrit.wikimedia.org/r/665145

Change 665146 merged by jenkins-bot:
[data-values/value-view@master] LanguageSelector: make language names optional

https://gerrit.wikimedia.org/r/665146

Change 666000 had a related patch set uploaded (by Jakob; owner: Jakob):
[mediawiki/extensions/WikibaseLexeme@master] InvalidLanguageIndicator: inject valid languages

https://gerrit.wikimedia.org/r/666000

Change 665315 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Show all available languages in Gloss lang suggester

https://gerrit.wikimedia.org/r/665315

Change 665309 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Show all available languages in monolingual text lang suggester

https://gerrit.wikimedia.org/r/665309

Change 666000 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] InvalidLanguageIndicator: inject valid languages

https://gerrit.wikimedia.org/r/666000

Change 666132 had a related patch set uploaded (by Jakob; owner: Jakob):
[mediawiki/extensions/WikibaseLexeme@master] Move dynamic source file callback out of resource.php

https://gerrit.wikimedia.org/r/666132

Change 666132 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Move dynamic source file callback out of resource.php

https://gerrit.wikimedia.org/r/666132

Amy and I tested it on test. Our observation:

  • It works fine for dag and a few other codes we tested \o/
  • There are some issues with other codes that we found when testing with ctg. It is not showing up in the selector. The publish link however turns blue, indicating that it'd be accepted. When clicking publish it is then rejected. See the screenshots below.

image.png (223×1 px, 19 KB)

image.png (223×1 px, 18 KB)

@Lydia_Pintscher Could it be that ctg isn't a monolingual text language? I found it in the list of additional lexeme term languages but not in the list of monolingual text languages.

Update: Yes, according to T271589 ctg was only added as a lexeme term language, so this works as designed. On a side note, one of the patches here also made all available lexeme term languages pop up in their respective language selectors, where you'll now also find ctg:

Screenshot from 2021-02-25 09-53-12.png (170×254 px, 13 KB)

@Lydia_Pintscher Could it be that ctg isn't a monolingual text language? I found it in the list of additional lexeme term languages but not in the list of monolingual text languages.

Update: Yes, according to T271589 ctg was only added as a lexeme term language, so this works as designed. On a side note, one of the patches here also made all available lexeme term languages pop up in their respective language selectors, where you'll now also find ctg:

Screenshot from 2021-02-25 09-53-12.png (170×254 px, 13 KB)

Yeah I think it's fine and expected that it isn't accepted. However then the publish link should not turn from gray to blue, right?

@Lydia_Pintscher Could it be that ctg isn't a monolingual text language? I found it in the list of additional lexeme term languages but not in the list of monolingual text languages.

Update: Yes, according to T271589 ctg was only added as a lexeme term language, so this works as designed. On a side note, one of the patches here also made all available lexeme term languages pop up in their respective language selectors, where you'll now also find ctg:

Screenshot from 2021-02-25 09-53-12.png (170×254 px, 13 KB)

Yeah I think it's fine and expected that it isn't accepted. However then the publish link should not turn from gray to blue, right?

The publish link always turns blue for any input on the language selector. This was done previously to allow languages that aren't part of the dropdown to be entered, so this language selector works exactly the same way as it did before, just with a more complete list of languages. I agree that it makes a lot less sense now that all allowed languages are actually in the dropdown.

If this ticket is going to be closed, which ticket covers showing the language names in the dropdown?

And why are the language names missing for the ones I listed in #6753045 anyway? Those are all ones which are available for labels (see the language selector on https://test.wikidata.org/wiki/Special:NewItem) and I don't know why they weren't showing up in the list in the first place.

@amy_rc ^ Can you create a new ticket for Nikki's comment?

Change 668713 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Remove outdated comment in getDefaultMonolingualTextLanguages()

https://gerrit.wikimedia.org/r/668713

Change 668713 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Remove outdated comment in getDefaultMonolingualTextLanguages()

https://gerrit.wikimedia.org/r/668713