Page MenuHomePhabricator

[Story] Show all available languages in monolingual text value's suggester
Open, NormalPublic

Description

Right now you can use more language codes for monolingual text values than what the suggester knows. This will get worse with T124757. As a user, I want to see all available options in the suggester.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
adrianheine raised the priority of this task from to Normal.Jan 26 2016, 10:05 AM
adrianheine updated the task description. (Show Details)
adrianheine added a project: Wikidata.
adrianheine added a subscriber: adrianheine.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2016, 10:05 AM

From story time meeting on 2016-02-23:

  • Have an API module for looking up language codes and associated language names
  • API module needs to support different language sets (e.g. for monolingual text, terms, etc..)
  • Parameters needed by the module: UI language, search string, desired language set
Charlie_WMDE added a subscriber: Charlie_WMDE.

+1 from UX-perspective. It is very confusing for the user that a valid language doesn't show up in the suggester with no apparent reason provided.

@Lydia_Pintscher What needs to be done to push this forward?

A mock-up for the developers is needed.

Restricted Application added a project: Design. · View Herald TranscriptSep 14 2017, 11:42 AM
Lokal_Profil added a subscriber: Lokal_Profil.
Charlie_WMDE moved this task from Incoming WD to To Do on the WMDE-Design board.Nov 13 2017, 2:11 PM

@Lydia_Pintscher before I make one I just wanted to make sure we're talking about the same thing here.

The ticket asks for all available languages to show up in the monolingual language drop down, which is currently not the case.

This is the drop-down i'm talking about:

And this is a language code which exists but can't be added because it doesn't appear in the drop-down:

What would you need on the mock-up for this to go into development? I could merely add some languages that are not currently available. Other than that they should just simply be in there with the rest of the language codes or is there something I'm missing here?

The issue we have is that for some of the codes we do not have a language name so we can not show them in the dropdown - only the code. So how should they be shown in the list?

T109459 is the task for a better UI for this in general.

@Lydia_Pintscher then my question is, why do some of the languages not have a language name? And is that maybe something we should change.

Because they don't come from where the other ones come from and we can't add translation in this place.

I'm sorry. I think I still don't understand this. Where do "the other ones" and where do "they" come from and why do they need a translation. Shouldn't every monolingual language have an item? Also currently the field is not able to find languages when typed in the "wrong" language as the property suggester would, i.e. when in the English interface, looking for deutsch, it won't find german.

daniel added a comment.Dec 7 2017, 3:41 PM

@Charlie_WMDE our current language list comes from mediawiki's UI i18n, it has nothing to do with Items. We have a way to allow additional code, but no good way to define localized language names for them. Language names fur the standard UI languages come from the Unicode CLDR library.

All that makes sense if you think of item labels - we use them to adopt the display of items to the user's UI language. All this makes no sense at all if you think of lexemes. Hence the confusion. And somewhere in the middle, you have monolingual text.

This issue keeps confusing our users. And indeed, as @adrianheine pointed out in the task description this is getting worse every time we add a new language code that does not show up in the suggester.

A temporary workaround could be to add the additional language codes to the list of suggested languages. These entries would only show the code, but no language name. That's obviously far from perfect, but much better than nothing. Have a look at the JavaScript class wikibase.WikibaseContentLanguages. It currently simply returns the UniversalLanguageSelector's language list, but excludes a few that are also excluded in the backend (see WikibaseRepo::getMonolingualTextLanguages). This means there is already some duplication going on in the backend and frontend! This duplication could either be expanded, or resolved by introducing a MediaWiki-ResourceLoader module that returns the list of languages allowed in monolingual values.

Change 425785 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: Thiemo Kreuz (WMDE)):
[mediawiki/extensions/Wikibase@master] [WIP] Expose additional monolingual languages to LanguageSelector

https://gerrit.wikimedia.org/r/425785

Okay, recap:

We only display languages on CLDR. If possible they appear (and are searchable) in the language of the preferences and the language code. If not translated then they appear in English in the dropdown.

When selected and saved it appears as "text string (name of language in english)". regardless if CLDR has a translation for it or not and how it appeared to the user in the dropdown.

Languages that are not on this list, don't appear at all, but are secretly savable.

Things that need to happen:

  • Languages not in CLDR but that have been approved by the phabricator process need to show up in the drop down as well
  • What is shown in the drop down and what appears after having saved should be consistent
  • Decide what to do with languages that don't have a translation

In the meeting we talked about preferably showing the language name in the language that is selected in the settings. If that's not available I think it makes sense to show the language code rather than showing it in the language of the language because that will often not be readable. Reverting to English as a default may seem sensible for us, but this would put non-English speakers at a disadvantage.

I suggest another meeting to decide this. @Lydia_Pintscher

I can't think of any place where we would show only language codes. Usually it is one of these:

  1. autonyms only
    • E.g. Universal Language Selector, Interlanguge list, translatable pages
  2. language code + translated language name with fallback to autonym
    • E.g. action=info, Special:PageLanguage
  3. language code + autonym
    • E.g. Advanced search beta feature, Special:Preferences

Also, falling back to English is not foolproof, as it might also be not available until added.

Zache added a subscriber: Zache.Jun 6 2018, 5:30 PM

Not sure if anyone is still tracking this, but ran into this today, and doesn't seems to work at all with any Native American languages.

Tarrow added a subscriber: Tarrow.Dec 4 2018, 9:12 AM
Mvolz added a subscriber: Mvolz.Feb 27 2019, 2:03 PM

@Amire80 @Sascha Do any of you have experience with adding languages/language names in CLDR? Is that a complex or long process?

I reported a few CLDR issues, and some of them were resolved, but I can't say I'm exceptionally good at getting them to resolve my issues or at adding new languages. I think that @Nemo_bis may be more experienced in this particular area, however.

Nikki added a comment.Mar 13 2019, 7:08 PM

@Amire80 @Sascha Do any of you have experience with adding languages/language names in CLDR? Is that a complex or long process?

I am not either of those people, but my comment at T151269#2822033 seems relevant here. CLDR have already rejected some of our requests because they don't want to add lots of language names. There's a suggestion at T168799 to create our own extension instead.

The easiest way to add a new language to CLDR is preparing ‘seed’ files in XML format;

When reading data from CLDR, consider injecting the English names from the IANA language subtag registry as a fallback when a language is missing from CLDR. That would immediately give at least an English name to every language in existence (provided it has an ISO/IETF language code). Another good data source for enriching CLDR might be Wikidata, via property P305 (IETF language tag).

Disclaimer: I volunteer at Unicode CLDR and am the maintainer for some minor parts of its codebase. So in my personal experience, the process has been super smooth.. :-)

@Nikki, can you send me your CLDR tickets that got rejected? I’d like to understand the reason, it sounds surprising.

Thank you all for your feedback!
@Sascha Your experience with CLDR could definitely be useful for Wikidata, since we're struggling with displaying names of languages that are not entered yet in CLDR.

For example, Numidian (nxm) has been added as an available language for monolingual text in Wikidata, but when I try to use it it's not appearing in the suggestion list, causing confusion for users who may think that the language is not available.

(quick way to test it: go to the sandbox, add a new statement with the property title, then enter a test value and finally type "nxm" or "num" in the language field that appears: Numidian is not suggested. However, if you type nxm and save the statement, it's correctly saved and "Numidian" is displayed)

The code "nxm" seems to be unavailable in CLDR https://www.unicode.org/repos/cldr/trunk/seed/main/nxm.xml

I'd like to make an experiment with this example: try adding this language to CLDR, and see if this action solves our problem on Wikidata. Would anyone be willing to try submitting data about Numidian to CLDR? :)

Sascha added a comment.EditedMar 14 2019, 2:52 PM

Sure, but it will take a while until the next official release of CLDR so you'd have to read the CLDR data from the development branch ("trunk"). I do wonder, though, if you could read the IANA registry in addition to CLDR and use IANA as fallback for the English names when CLDR has no data yet. Then, you would immediately get an English name for every language with an ISO 639 or IETF BCP 47 code, so you'd add support for a couple thousand languages at once.

Can you point me to the source repository where you are currently reading CLDR?

Nikki added a comment.Mar 14 2019, 4:57 PM

@Nikki, can you send me your CLDR tickets that got rejected? I’d like to understand the reason, it sounds surprising.

The ones I'm aware of are the ones I mentioned in T151269#2822033 and the comment directly after it.

Oh, all you need from CLDR is an English label? Nothing else? In that case, this Wikidata query might be helpful:

SELECT ?code ?itemLabel
WHERE  {
  ?item wdt:P305 ?code
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

It shouldn't be difficult to write a script that fetches the current CLDR data file and patches it with labels from Wikidata. (Also in other languages than English). Even easier might be to change the source code of the tool that you're currently using to read CLDR; is that tool publicly available?

In T124758#4126411, thiemowmde wrote:

A temporary workaround could be to add the additional language codes to the list of suggested languages. These entries would only show the code, but no language name. That's obviously far from perfect, but much better than nothing. Have a look at the JavaScript class wikibase.WikibaseContentLanguages. It currently simply returns the UniversalLanguageSelector's language list, but excludes a few that are also excluded in the backend (see WikibaseRepo::getMonolingualTextLanguages). This means there is already some duplication going on in the backend and frontend! This duplication could either be expanded, or resolved by introducing a MediaWiki-ResourceLoader module that returns the list of languages allowed in monolingual values.

A list of monolingual language codes is now available, though not as a ResourceLoader module, but via the action API, as meta=wbcontentlanguages. And as far as I can tell, we actually have English names for all those languages:

$ curl -G -s \
    -d action=query \
    -d meta=wbcontentlanguages \
    -d wbclcontext=monolingualtext \
    -d wbclprop='code|name' \
    -d format=json \
    -d formatversion=2 \
    https://www.wikidata.org/w/api.php | \
  jq -c '.query.wbcontentlanguages | .[] | select(.name == null)' | \
  wc -l
0

It looks like @Raymond periodically adds them to our CLDR MediaWiki extension, as an addition to the upstream CLDR data (example change). I don’t know why they’re not displayed on the monolingual statements themselves (example statement, currently shows “sjn” instead of “Sindarin”), but we seem to have them in some form or other. (I guess this also answers @Sascha’s last question?)

It looks like @Raymond periodically adds them to our CLDR MediaWiki extension, as an addition to the upstream CLDR data (example change). I don’t know why they’re not displayed on the monolingual statements themselves (example statement, currently shows “sjn” instead of “Sindarin”), but we seem to have them in some form or other. (I guess this also answers @Sascha’s last question?)

Yes, I monitor addition of new languages and add them to CLDR as soon as possible. In your example I see the word "Sindarin". But not while typing the language name or language code into the input field. I am not sure of this is a regression/new bug.

Btw: My (more or less) complete test item is https://test.wikidata.org/wiki/Q149653

In your example I see the word "Sindarin".

Ah – after I purged the English page, I see “Sindarin” as well, so that’s actually working, it was just cached from before your addition. I assume you were looking at the page in German, and the German version wasn’t cached yet.

But not while typing the language name or language code into the input field.

Yes, that’s what this bug is about :) we currently don’t have those extra language codes and names client-side.