Page MenuHomePhabricator

[TECH] Use LanguageNameUtils::ALL for monolingual text and lexemes
Closed, ResolvedPublicFeature

Description

Currently for monolingual text and lexemes, Wikibase uses the defaults for LanguageNameUtils, which only returns "defined" languages (whatever that means). If it instead requested all known languages using LanguageNameUtils::ALL, it would include all the codes known to the CLDR extension, including the ones from CldrNamesEn.php.

  • This would make another 230+ languages available, reducing the number of languages we have to dump under mis (related: T289776)
  • There are existing requests for at least 18 of these: T313782, T332265, T332256, T214238, T332258, T320984, T316004, T332262, T332259, T314458, T317497 (akk, hit), T332255 (bum, ken, sba), T321957 (dum), T321979 (mga, sga)
  • Most if not all of the extra languages for monolingual text and lexemes would no longer be necessary (Wikibase does not add language names for its extra languages, so they all have to be added to the CLDR extension too).
  • Monolingual text and lexemes would use the same set of languages: T320889
  • If it includes any language codes that we decide we don't want, there is already a way to exclude codes for monolingual text (link) and T320887 requests the same for lexemes.

Acceptance Criteria:

Related Objects

Mentioned In
T344244: Add language code "hoc" (Ho) for Wikidata labels
T344662: Add language code pau (Palauan) for monolingual text and lexemes
T346470: Add monolingual language code dak
T148887: Add monolingual language code nn-hognorsk for Høgnorsk
T333424: Add language code gsg (German Sign Language) for monolingual text
T321956: Add language codes gml (Middle Low German), peo (Old Persian), gmy (Mycenaean Greek), cop (Coptic) for lexemes
T350177: Add language code ain (Ainu) for lexemes and monolingual text
T297350: [GOAL] Improve experience around adding new language codes for Wikidata
T320889: Use the same list of languages for monolingual text and lexemes
T321979: Add language codes oco (Old Cornish), cnx (Middle Cornish), owl (Old Welsh), wlm (Middle Welsh) for lexemes and monolingual text
T321957: Add language codes odt (Old Dutch), ofs (Old Frisian), osx (Old Saxon), frk (Frankish) for lexemes and monolingual text
T332255: Add language codes bse, mhk, tui for monolingual text and lexemes
T317497: Add monolingual and lexeme language codes akk and hit
T314458: Add item termbox label support for trw
T332259: Add language code srr (Serer) for monolingual text and lexemes
T332262: Add language code shu (Chadian Arabic) for monolingual text and lexemes
T316004: Add item termbox label support for Rajasthani (raj)
T332258: Add language code lua (Luba-Lulua) for monolingual text and lexemes
T214238: Add es-ES Language to Wikidata
T332256: Add language code dyu (Dyula) for monolingual text and lexemes
T332265: Add language code bik (Bikol) for monolingual text and lexemes
T313782: Allow support for terms (label, description, aliases) for bal
T345083: MUL - Change the copy to "default values" in different places
T351504: Remove non-BCP47 language code dlc (Dalecarlian) from cldr extension
T346167: Add monolingual text code "pks" for Pakistan Sign Language
T273627: Remove wmgExtraLanguageNames from Wikimedia production
Mentioned Here
T351504: Remove non-BCP47 language code dlc (Dalecarlian) from cldr extension
T334349: Reduce exposure of MediaWiki internal language codes
T322139: Special:NewLexeme and wbcontentlanguages in the API do not use the same language names for additional languages
T190129: Consolidate language metadata into a 'language-data' library and use in MediaWiki
T281067: merge CLDR extension to core
T231755: Local language name should be translatable in translatewiki.net
T168799: Integrate IANA language registry with language-data and MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes)
T312845: [Process] Add new language codes to Wikidata
T214238: Add es-ES Language to Wikidata
T289776: Enable all ISO 639-3 codes on Wikidata
T313782: Allow support for terms (label, description, aliases) for bal
T314458: Add item termbox label support for trw
T316004: Add item termbox label support for Rajasthani (raj)
T317497: Add monolingual and lexeme language codes akk and hit
T320887: Language codes that are explicitly not allowed for monolingual text should also not be allowed for lexemes
T320889: Use the same list of languages for monolingual text and lexemes
T320984: Add language code mnc (Manchu) for lexemes
T321957: Add language codes odt (Old Dutch), ofs (Old Frisian), osx (Old Saxon), frk (Frankish) for lexemes and monolingual text
T321979: Add language codes oco (Old Cornish), cnx (Middle Cornish), owl (Old Welsh), wlm (Middle Welsh) for lexemes and monolingual text
T332255: Add language codes bse, mhk, tui for monolingual text and lexemes
T332256: Add language code dyu (Dyula) for monolingual text and lexemes
T332258: Add language code lua (Luba-Lulua) for monolingual text and lexemes
T332259: Add language code srr (Serer) for monolingual text and lexemes
T332262: Add language code shu (Chadian Arabic) for monolingual text and lexemes
T332265: Add language code bik (Bikol) for monolingual text and lexemes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I assume we always want to request the same language here, rather than make this depend on the user / request language; should it be the wiki content language (en on Wikidata), a hard-coded one (e.g. en or qqq), or something else?

On second thought – it should probably be en, since the language names will also fall back to en, not the wiki content language. If we used the content language, then a wiki with a non-en content language might have extra language codes (e.g. en-uk or az-arab) with no language names available for some request languages, which doesn’t sound great.

Thx for the ping, Thimo!

I am all for simplifying the current process, as it is inconsistent and hard to maintain.

@Lydia_Pintscher could there be unintended consequences with going the route described in this task?

Yeah I am still kinda attached to the current process but I also must face the fact that it's not working. So I'm fine with doing this.

Task Review Notes:

  • This is probably not a full-blown epic, the particular requirements for this task can be achieved relatively simply, however we should anticipate a few followup tasks that might come out of it.
  • Specifically, one followup could be to consolidate language name sources, but any problems arising from this will be quite obvious as they occur.

Prio Notes:

  • Affects end users / production
  • Does not affect monitoring
  • Does not (really) development efforts
  • Affects onboarding efforts (After this change we will not have to onboard new hires to the language addition process and how to review it)
  • Affects additional stakeholders (langcom)
ItamarWMDE renamed this task from Use LanguageNameUtils::ALL for monolingual text and lexemes to [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes.Sep 12 2023, 1:18 PM
ItamarWMDE moved this task from WikibaseLexeme to [DOT] Prioritized on the wmde-wikidata-tech board.
ItamarWMDE added a project: Wikidata Dev Team.

I might get this wrong. But as I understand the proposal it would make the currently established processes of how languages on wikidata.org are managed, requested, and confirmed (briefly described in T312845) obsolete.

The documentation would need to be updated but it wouldn't make it completely obsolete. LanguageNameUtils::ALL is all of the language codes that MediaWiki knows about (≈ those it uses itself plus those which CLDR has locale data for), but that's still only a fraction of all valid ISO 639/BCP 47 language codes (languageinfo has 978, ISO 639-3 has 7916), so people would still need a way to request missing codes.

Whether requests for things that are still missing should be accepted by Wikidata first or go straight to the CLDR extension depends whether the people maintaining the CLDR extension are ok with people making requests for missing languages there.

the basic idea is that there is an "official" working group that intentionally reviews and accepts new languages one by one only when they are actually needed.

That is still how it's intended to work and it still doesn't work well. The people who are being asked to review language codes one by one do not want to. People who request codes still have to wait months, if not years. Both @jhsoby and @Amire80 have asked why we can't just enable all ISO 639-3 codes instead of enabling them one by one (or something to that effect), and that's what editors have asked for too (T289776).

  • This would make another 230+ languages available, reducing the number of languages we have to dump under mis (related: T289776)

And if T168799: Integrate IANA language registry with language-data and MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes) happens, that would take us the rest of the way to T289776: Enable all ISO 639-3 codes on Wikidata, right?

I don't know. Does T289776 include labels or not?

I limited this request to monolingual text and lexemes because almost every valid language code would be useful in Wikidata for those (lexemes: any known word in the language, monolingual text: native label on the language itself, usage example on lexemes, etc). People are going to add that data whether the right code is available or not, so if MediaWiki already knows a language code exists, I think it makes sense to allow it.

From a technical side, I don’t see major issues with this proposal. But we might want to consolidate language name sources; currently, we have some wikibase-lexeme-language-name-* messages in WikibaseLexeme (but not used by Wikibase), and also some languages names in the cldr extension (LocalNames/ directory). Maybe we can make Wikibase fall back to the language code and also track the missing language name, so we can have a Grafana board for the most frequently used language codes without names. But I think that doesn’t need to block this task.

MediaWiki normally shows the language code if it can't find a name, so I don't think Wikibase would need to do anything special there, would it?

If I'm not mistaken, it should already be possible to determine which ones are missing using wbcontentlanguages (although I recently added all the missing names so you'd need to test it locally).

I would be happy to see the names consolidated, they're inconsistent at the moment (T322139). It's difficult to translate the names in the CLDR extension though, but perhaps it could be made translatable on translatewiki.net (like I suggested in this year's community wishlist).

The additional cldr language codes are only added when asking for language names in a specific language, and the returned language codes vary slightly depending on which language you ask for:
[...]
(de and bar have additionally en-uk, with bar presumably inheriting it from de via language fallback; pt’s extra language code is az-arab.) I assume we always want to request the same language here, rather than make this depend on the user / request language; should it be the wiki content language (en on Wikidata), a hard-coded one (e.g. en or qqq), or something else?

Hm, that doesn't sound good. Is that actually a bug in the CLDR extension? I would expect the set of language codes to be the same regardless of the language being used and that not being the case sounds like it would cause problems eventually. Perhaps it should have tests to make sure none of the files have extra codes that don't exist for English, or perhaps it should ignore any codes that aren't defined for all languages? Making the extension translatable would help here too, I imagine.

I notice, I'm still a bit confused as to where CLDR is getting its languages from. Partly from core, partly from a manually maintained list (localNamesXX.php), but there are also comments like # Added to Core, not part of CLDR, T287345. What is that CLDR mentioned in the CLDR extension itself?

There is a slight ambiguity in the task description that I didn’t realize before. If we take it literally, and only pass LanguageNameUtils::ALL as the second getLanguageNames() argument while leaving the first argument the same (LanguageNameUtils::AUTONYMS, the default), then we won’t actually see any difference

That would be due to

		if ( $inLanguage !== self::AUTONYMS ) {
			# TODO: also include for self::AUTONYMS, when this code is more efficient
			// @phan-suppress-next-line PhanTypeMismatchArgumentNullable False positive
			$this->hookRunner->onLanguageGetTranslatedLanguageNames( $names, $inLanguage );
		}

in LanguageNameUtils.php. That means when requesting Autonyms, the extra languages from CLDR are not loaded.

There seems to be a mistake in the description. The languages in CldrNamesEn.php are the MedaWiki ones (that is what rebuild.php uses), the additional languages that we care about would seem to be the ones coming from LocalNamesEn.php and parallel files, right?

The additional cldr language codes are only added when asking for language names in a specific language, and the returned language codes vary slightly depending on which language you ask for:
[...]
(de and bar have additionally en-uk, with bar presumably inheriting it from de via language fallback; pt’s extra language code is az-arab.) I assume we always want to request the same language here, rather than make this depend on the user / request language; should it be the wiki content language (en on Wikidata), a hard-coded one (e.g. en or qqq), or something else?

Hm, that doesn't sound good. Is that actually a bug in the CLDR extension? I would expect the set of language codes to be the same regardless of the language being used and that not being the case sounds like it would cause problems eventually. Perhaps it should have tests to make sure none of the files have extra codes that don't exist for English, or perhaps it should ignore any codes that aren't defined for all languages? Making the extension translatable would help here too, I imagine.

en-uk (together with en-gb) was added in Add some German translation (I0ce22dfc). CLDR seems to be defacto used as a repository for names for language codes that happen to be used by people. Not at all as an authoritative source for language codes. Are we ok with using it anyway?

Also, I note that a lot of language names that have been added there seem to include a comment # used by Wikidata, T123456. So we may still want a process to add more, given that our current process is how we got to this list.

Further Async Storywriting notes:

Needs AC, aside from the one for actually doing the thing, also one or more for updating docs/policy/process which exists at least in the following places:

Also, should have an AC to go through the existing language related tasks and figure out which are still needed, maybe update them, and close the ones no longer needed after this one here is done.

Thank you, I will add the AC you mentioned, but let others who are more experienced with CLDR try to clarify the ambiguities you found.

CLDR seems to be defacto used as a repository for names for language codes that happen to be used by people. Not at all as an authoritative source for language codes. Are we ok with using it anyway?

The CLDR extension currently provides:

The Names.php together with $wmgExtraLanguageNames provides autonyms of languages.

The language-data provides more autonyms of languages (all current language in Names.php are in language-data but not vice versa). Currently the main use of the library is the frontend lan
guage selector (UniversalLanguageSelector), but it is proposed to replace Names.php (T190129) and also CLDR extension (T281067).

  • This would make another 230+ languages available, reducing the number of languages we have to dump under mis (related: T289776)

And if T168799: Integrate IANA language registry with language-data and MediaWiki (let MediaWiki "knows" all languages with ISO 639-1/2/3 codes) happens, that would take us the rest of the way to T289776: Enable all ISO 639-3 codes on Wikidata, right?

I don't know. Does T289776 include labels or not?

Hm, unclear. But it’s a good point that this task is not supposed to include labels, since a simple implementation of it (like I was playing around with earlier) would affect labels as well; I’ve added that to the task description.

The additional cldr language codes are only added when asking for language names in a specific language, and the returned language codes vary slightly depending on which language you ask for:
[...]
(de and bar have additionally en-uk, with bar presumably inheriting it from de via language fallback; pt’s extra language code is az-arab.) I assume we always want to request the same language here, rather than make this depend on the user / request language; should it be the wiki content language (en on Wikidata), a hard-coded one (e.g. en or qqq), or something else?

Hm, that doesn't sound good. Is that actually a bug in the CLDR extension? I would expect the set of language codes to be the same regardless of the language being used and that not being the case sounds like it would cause problems eventually. Perhaps it should have tests to make sure none of the files have extra codes that don't exist for English, or perhaps it should ignore any codes that aren't defined for all languages? Making the extension translatable would help here too, I imagine.

Yeah, that should probably be fixed in the cldr extension. But I’ve convinced myself now that we should ask for the en language names, so as far as I’m concerned this variation is no longer a problem for Wikibase ^^

ItamarWMDE renamed this task from [SW] Use LanguageNameUtils::ALL for monolingual text and lexemes to Use LanguageNameUtils::ALL for monolingual text and lexemes.Sep 18 2023, 3:34 PM
ItamarWMDE renamed this task from Use LanguageNameUtils::ALL for monolingual text and lexemes to [TECH] Use LanguageNameUtils::ALL for monolingual text and lexemes.
WARNING: Currently LanguageNameUtils::ALL return MediaWiki internal language codes, but Wikibase should use BCP 47 language codes instead.

Wikibase already uses MediaWiki internal language codes (e.g. simple instead of en-simple). Do you know of any particular non-standard language codes in LanguageNameUtils::ALL that Wikibase doesn’t already use at the moment? (I’d expect the non-DEFINED language codes, i.e. the ones that we would add by using ALL, to be more likely to follow the standard.)

So I wrote "should", as it's already some kind of tech debt.

Change 974655 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikibase@master] Use LanguageNameUtils::ALL for monolingual text languages

https://gerrit.wikimedia.org/r/974655

Change 974656 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/WikibaseLexeme@master] Support all monolingual text languages for Lexemes

https://gerrit.wikimedia.org/r/974656

It seems this would allow the following 38 language codes which are not well known bcp47 codes:

ar-001
ccp-beng
cja-arab
cja-cham
cja-latn
cjm-arab
cjm-cham
cjm-latn
cjy-hans
cjy-hant
dlc
eo-hsistemo
eo-xsistemo
es-es
es-mx
fa-af
fr-ch
ha-arab
hi-latn
lad-hebr
nl-be
nn-hognorsk
pt-ao1990
pt-colb1945
pt-pt
rhg-rohg
ro-md
ruq-grek
sat-beng
sat-latn
sat-orya
shy-arab
shy-tfng
sr-me
sux-latn
sux-xsux
sw-cd
syl-beng

For now I chose to manually exclude these. If we want to support some of these, we can manually map them to another bcp47 code for RDF output using the canonicalLanguageCodes Wikibase configuration.

Also in Lexeme we currently support a couple of languages that Wikibase does not support (some of these explicitly excluded in WikibaseContentLanguages::getDefaultMonolingualTextLanguages, some as part of the list above) :

bat-smg
be-x-old
ccp-beng
de-formal
eo-hsistemo
eo-xsistemo
es-formal
fiu-vro
ha-arab
hu-formal
lad-hebr
nl-informal
nn-hognorsk
pt-ao1990
pt-colb1945
rhg-rohg
roa-rup
sat-beng
sat-latn
sat-orya
simple
sux-latn
sux-xsux
syl-beng
zh-classical
zh-min-nan
zh-yue

To ensure backwards compatibility, I chose to keep all of these for lexemes. This implies that lexeme will still support additional languages that are not supported for monolingual text (but all languages we support for monolingual text are support for lexemes).

As these questions are still open, I chose ignore the documentation related acceptance criteria for now.

It seems this would allow the following 38 language codes which are not well known bcp47 codes:

...
cjy-hans
cjy-hant
...

At least these are known language codes, Language Subtag Registry just not listing them with the script code:

https://translatewiki.net/wiki/Portal:Cjy

Please note that the Language Subtag Registry currently don't list all language codes written in different scripts with script code.

It seems this would allow the following 38 language codes which are not well known bcp47 codes:

ar-001
ccp-beng
cja-arab
cja-cham
cja-latn
cjm-arab
cjm-cham
cjm-latn
cjy-hans
cjy-hant
dlc
eo-hsistemo
eo-xsistemo
es-es
es-mx
fa-af
fr-ch
ha-arab
hi-latn
lad-hebr
nl-be
nn-hognorsk
pt-ao1990
pt-colb1945
pt-pt
rhg-rohg
ro-md
ruq-grek
sat-beng
sat-latn
sat-orya
shy-arab
shy-tfng
sr-me
sux-latn
sux-xsux
sw-cd
syl-beng

I think almost all of these are valid.

  • ar-001: 001 is a known subtag (“world” region) for any language tag; ar is a known language tag (Arabic)
  • ccp-beng, cja-arab, cja-cham, cja-latn, cjm-arab, cjm-cham, cjm-latn, cjy-hans, cjy-hant, ha-arab, hi-latn, lad-hebr, rhg-rohg, ruq-grek, sat-beng, sat-latn, sat-orya, shy-arab, shy-tfng, sux-latn, sux-xsux, syl-beng:
    • Beng, Arab, Cham, Latn, Hans, Hant, Hebr, Rohg, Grek, Orya, Tfng, Xsux are all known subtags (various scripts)
    • ccp, cja, cjm, ciy, ha, hi, lad, rhg, ruq, sat, shy, sux, syl are all known language tags
  • es-es, es-mx, fa-af, fr-ch, nl-be, pt-pt, ro-md, sr-me, sw-cd:
    • ES, MX, AF, CH, BE, PT, MD, ME, CD are all known subtags (various regions)
    • es, fa, fr, nl, pt, ro, sr, sw are all known language tags
  • eo-hsistemo, eo-xsistemo: hsistemo and xsistemo are known subtags for eo (Esperanto spelling systems)
  • nn-hognorsk, pt-ao1990, pt-colb1945: like eo-_sistemo just above

The only one I can’t make any sense of is dlc, which MediaWiki says is “Dalecarlian”, yet Dalecarlian language doesn’t mention a dlc language code. Apparently the cldr extension (but not the actual CLDR) took it from Ethnologue in 2008.

The only one I can’t make any sense of is dlc, which MediaWiki says is “Dalecarlian”, yet Dalecarlian language doesn’t mention a dlc language code. Apparently the cldr extension (but not the actual CLDR) took it from Ethnologue in 2008.

I think this one should be removed from the CLDR extension. It's not a valid code and when I checked the other day, I wasn't able to find any current use of that code, nor anyone requesting it in the first place. It was just suddenly added in https://github.com/wikimedia/mediawiki-extensions-cldr/commit/45e8e42c040a5be96f380d48aea52819db7f1c7e with no explanation.

Some history of the dlc code:

Before ISO 639-3 was a thing, Ethnologue used its own set of codes (using uppercase letters). The 14th edition (2000) had an entry for Dalecarlian under the code DLC (link). For the 15th edition (2005), they switched to using codes from the draft ISO 639-3 standard (using lowercase letters). The 15th edition had an entry for dlc, saying that was also the ISO 639-3 code (link). ISO 639-3 was published in 2007 and did not include an entry for dlc or "Dalecarlian". The 16th edition (2009) did not include dlc any more (link). ISO 639-3 codes were not added to the IETF/IANA/BCP 47 subtag registry until much later, in 2009, so it didn't get included there either.

I assume dlc was in the draft version of ISO 639-3 and was removed before it was officially published, and then was removed from Ethnologue too, but it had already been added to LocalNamesEn.php by then.

Alright, filed T351504: Remove non-BCP47 language code dlc (Dalecarlian) from cldr extension for that. Then I think we can filter out dlc in Wikibase (until it’s gone from the cldr extension), but keep the other language codes.

Moving back to In Development, there’s no review ongoing at the moment.

Change 974655 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use LanguageNameUtils::ALL for monolingual text languages

https://gerrit.wikimedia.org/r/974655

Change 974656 merged by jenkins-bot:

[mediawiki/extensions/WikibaseLexeme@master] Support all monolingual text languages for Lexemes

https://gerrit.wikimedia.org/r/974656

Arian_Bozorg subscribed.

Looks like this is all good! Thanks so much :)

Change 990753 had a related patch set uploaded (by Nikki; author: Nikki):

[mediawiki/extensions/Wikibase@master] Exclude qqq from monolingual text languages

https://gerrit.wikimedia.org/r/990753

Change 991061 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Nikki):

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.14] Exclude qqq from monolingual text languages

https://gerrit.wikimedia.org/r/991061

Change 990753 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Exclude qqq from monolingual text languages

https://gerrit.wikimedia.org/r/990753

Change 991061 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@wmf/1.42.0-wmf.14] Exclude qqq from monolingual text languages

https://gerrit.wikimedia.org/r/991061

Mentioned in SAL (#wikimedia-operations) [2024-01-17T15:01:23Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]]

Mentioned in SAL (#wikimedia-operations) [2024-01-17T15:02:56Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-01-17T15:09:23Z] <logmsgbot> lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:991061|Exclude qqq from monolingual text languages (T341409)]] (duration: 07m 59s)