Maniphest T190129

Consolidate language metadata into a 'language-data' library and use in MediaWiki
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Nikerabbit
	Mar 20 2018, 9:40 AM

Description

Affected components: MediaWiki core, Universal Language Selector, (also many extensions indirectly like Wikibase for getting better language data)
Engineer(s) or team for initial implementation: Language Team.
Code steward: Language Team.

Motivation

There is a growing need to have language metadata readily and efficiently available in different places inside and outside MediaWiki:

Definitions

The metadata in this context means information about languages that are often required to handle languages but are not exactly about localisation. There is often need to have the language metadata efficiently available for a large set of languages. Currently the metadata consists of:

language autonym (for language lists)
writing direction (for language lists, for displaying text tagged in a particular language)
writing script (for ULS language lists, possibly for fonts too)
regions where language is spoken (for ULS language lists, automatic language selection/suggestion with GeoIP)
fallback languages (for the fallback system to work without creating a circular dependency)

Description of issues

What we currently have in MediaWiki is only autonyms (Names.php) accessible via Language::fetchLanguageNames( Language::AS_AUTONYMS ). Writing direction and fallbacks are only available only by constructing appropriate Language objects or using Language::getLocalisationCache()->getItem(). This causes a related problem, that in order to set fallbacks or writing direction for a language, MessagesXX.php must be created, which again makes the language available for selection as an interface language in Special:Preferences. There is need to have this information available without making languages available as interface languages.

An existing database for that is https://github.com/wikimedia/language-data/ which already contains all of the above except language fallbacks.

Currently we have an overhead of updating the language metadata in multiple places: language-data itself, jquery.i18n (fallbacks), jquery.uls (copy of language-data), UniversalLanguageSelector extension (copy of jquery.uls and jquery.i18n) and MediaWiki core (Names.php, MessagesXX.php).

Exploration

To make the language metadata easy to use, and to reduce overhead of updating data in multiple different places the following actions are proposed:

1. T218639: Make language-data installable as a proper library

Make it installable via composer and/or npm
Make proper releases
Consider moving it to Gerrit

2. Bring language-data to MediaWiki core

Add the library as dependency
Decide which format to use (YAML or JSON)
Determine if additional caching or formats are required for performance (e.g. store as PHP code)

3. Add a mechanism for local overrides

Similarly how we can override plurals for CLDR data
Support two use cases: MediaWiki-customisations (e.g qqq, qqx, en-rtl) and site/farm specific customisations (outside git)

4. Replace Names.php with language-data

Keep the existing public APIs, but replace the data

5. Add/update PHP APIs to expose data from language-data

Similar to above, but with rest of the data
Consider adding a new API (PHP class) to access all the metadata in a uniform way (e.g. to access a direction of a language that is not available as an interface language)

6. Add/update Action APIs to expose data from language-data

Most Action APIs would already be updated due to them using the updated PHP APIs.
Fix T74153: meta=siteinfo&prop=langlinks should indicate whether a language is RTL
Consider whether it is a necessary to add new Action APIs to support frontend requirements

7. Move language fallbacks to language-data

First copy everything to language-data, then bring in updated language-data, then update LocalisationCache to use fallbacks from there

8. Update ULS to use language-data from core if available

ULS should be able to use language-data from core if available (to pick up local customisations) before falling back to the shipped version
Optionally the shipped version can be dropped at later point of time (but that requires stripping it from jquery.uls that will still require it)

Details

	Subject	Repo	Branch	Lines +/-
	Generate RTL languages via maintenance script	mediawiki/core	master	+129 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T213072 Language tools maintenance intervention: Improve processes for i18n support to be more fluent
Open		None	T32956 Make ResourceLoader a standalone library
Open		None	T190129 Consolidate language metadata into a 'language-data' library and use in MediaWiki
Resolved		abi_	T218639 Make language-data installable as a proper library
Resolved	Request	bd808	T252987 Requesting access to wikimedia namespace in packagist
Open	Feature	None	T355496 Allow override language name translations instead of only autonyms in system messages or config options

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2018, 9:40 AM

Amire80 awarded a token.Mar 20 2018, 9:40 AM

Mooeypoo mentioned this in T187599: Implement fallback handling in Kartotherian's language selection.Mar 20 2018, 10:05 AM

Amire80 moved this task from Untriaged to Language data and databases on the I18n board.Mar 20 2018, 2:20 PM

Change 415187 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/core@master] Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

gerritbot added a project: Patch-For-Review.Mar 20 2018, 4:10 PM

Smalyshev subscribed.Mar 22 2018, 6:23 PM

Jdlrobson mentioned this in T74153: meta=siteinfo&prop=langlinks should indicate whether a language is RTL.Mar 26 2018, 5:47 PM

Change 415187 abandoned by Jdlrobson:
Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

• Nikerabbit removed a project: Patch-For-Review.May 18 2018, 6:28 PM

• Nikerabbit mentioned this in T112889: Adding Latin American Spanish as language code.Jun 18 2018, 2:36 PM

• Nikerabbit mentioned this in T202794: Many more languages need to be added to Multilingual Wikisource (mul.ws).Aug 27 2018, 5:45 AM

jhsoby subscribed.Aug 27 2018, 1:49 PM

One thing that needs to be figured out is how to add languages to language-data where we don't know the native name of the language. Right now, the design of language-data makes that a requirement to be added, but there are many languages that could (and should) be supported in e.g. Wikidata and mul.Wikisource where we don't have that, making it impossible to add them properly.

We have languages not using native name already on the list. I doubt there are many cases where we cannot find one with some searching.

Jdforrester-WMF subscribed.Oct 23 2018, 7:27 PM

cscott awarded a token.Oct 23 2018, 7:55 PM

cscott subscribed.

Pginer-WMF added a parent task: T213072: Language tools maintenance intervention: Improve processes for i18n support to be more fluent.Jan 7 2019, 12:07 PM

• Nikerabbit added a project: TechCom-RFC.Jan 14 2019, 9:27 AM

• Nikerabbit updated the task description. (Show Details)

daniel moved this task from P1: Define to Under discussion on the TechCom-RFC board.Jan 17 2019, 6:27 AM

Consider whether it is a necessary to add new Action APIs to support frontend requirements

When Phab tasks are filed for that point, remember to add MediaWiki-Action-API so I'll see it.

• Nikerabbit added a subtask: T218639: Make language-data installable as a proper library.Mar 19 2019, 7:39 AM

• Nikerabbit updated the task description. (Show Details)

Liuxinyu970226 subscribed.Mar 19 2019, 10:42 AM

Krinkle renamed this task from Consolidate language metadata into language-data and use it in MediaWiki core to Consolidate language metadata into a 'language-data' library and use in MediaWiki.Mar 20 2019, 7:19 PM

@Nikerabbit Is this a project that would be implemented by Language Team? And, are you currently interested in wider input and/or approval? It appears in pretty good shape, but unsure whether its ready to be implemented later, or whether there are still unanswered questions or uncertainties.

As a general exercise, I'd like the RFC to identify potential stakeholders or affected parties that we want to hear from at minimum. E.g. which changes would be breaking for whom (if any), and would it affect anyone's workflows, if so whose?

Yes the plan is that Language team would implement this in near future. Feedback on the general approach would be welcome (creating a library that is then brought in to the core and integrated). There are some uncertainties in the details, such as file format and caching with regards to performance. My gut feeling is that there is no absolute need to make breaking changes in any of the APIs, but there might be places where we see better way of exposing this data. I would like to know whether the current detail is sufficient for input and/or approval. I would also like to know if people have other wishes that closely relate to this work. For example @cscott has been working on bringing our language codes closer to standard ones. For me the ability to add languages to MediaWiki core without making them available as interface languages is a such wanted outcome. This might surface some meta discussion about more closely defining the list of languages available in each context (like for translatable wiki pages).

The workflow for adding new languages to the mediawiki would change a bit, but that's already mostly on Language team and translatewiki.net. One stakeholder would be Wikidata which needs this data for their additional languages which are not interface languages. Other stakeholders could be various (product?) teams and volunteers building frontend features (using MediaWiki APIs) or external tools (who need to have this data but have so far been implementing their own way or scraping data from MediaWiki core (mostly language fallbacks)).

NoInkling subscribed.Mar 21 2019, 6:06 AM

• Nikerabbit mentioned this in T225756: Clean up languages/ directory in MediaWiki core (June 2019).Jun 19 2019, 7:56 AM

Susannaanas subscribed.Oct 15 2019, 6:32 PM

Notes from uncoference session:

P9640 Language metadata at Wikimedia Hackathon 2019 notes

1	Attendees: Niklas, Amir, Kelson, Leszek, Emmanuel
2
3
4	Intro:
5	list of places that contain language metadata
6	MW: languages/data/Names.php
7	2x mobile apps: language lists for RTL
8	MobileFrontend/Minerva: rtl languages
9	Wikidata Monologual codes
10	MW: special css rules for lin0height in some languages (should be for writing systems)
11	Wikidata Lexeme
12	Override for ULS
13	CLDR
14
15
16	https://github.com/wikimedia/language-data
17
18	contains:
19	ISO code
20	autonym
21	where it is spoken (continent)
22	writing system (incl. directionality)
23	fallbacks (planned)
24
25
26	Notes:
27	Amir: The following should be merged into central place:
28	MW: languages/data/Names.php
29	2x mobile apps: language lists for RTL
30	MobileFrontend/Minerva: rtl languages
31	Wikidata Monologual codes
32	MW: special css rules for lineheight in some languages (should be for writing systems)
33	Q: Why not use ICU?
34	A: Might not have all languages? Slow to upgrade.
35	ACTION: Consider further using/integration with ICU?
36	Panlex people claim Unicode contains grammatical rules for various languages. Would these also be in CLDR?
37	Amir: not sure, would need to check.
38	Amir: Why Wikidata maintains the custom list of language codes for monolingual codes?
39	Leszek: to allow using language code on top of the list provided by Mediawiki to use in Wikidata statements.
40	?: okay, so we have this library. Why not use other standard language libraries? Those are backed up by big consortia, which could update, maintain the data
41	ACTION: PHP binding for language data
42	A: We might have more languages
43	A: Also corporate parties are generally not interested in smaller languages as these might not have monetary value
44	Niklas: Wikimedia is actually member of Unicode. We also have a contact person at CLDR.
45	Niklas: CLDR might also require that language has a written code
46	Amir: Also for MediaWiki we don't want all languages from CLDR (e.g.extinct ones)
47	Emmanuel What does
48	ACTION: Mark which languages in language data can be content language for MW
49	N: We should make it clear what lists serve which context. If we just merge all the list together, we would make it even harder to understand which language/language code is suited for which context
50	ACTION: Share knowledge how does Kiwix use ICU.
51	Why Wikidata has their own restricted language?
52	ACTION: Document the policy adding stuff to language data
53	Discussed spectifics of Wikidata Lexicographical Data. It currently does allow adding data in non-MW language codes (using the "mis" language code)
54	There are better sources defining language codes/languages than MediaWiki, like Ethnologue,
55	N: How many of those different language list do we need?
56	1. MW content languages
57	2. Languages that would be translation targeten
58	3. Wikidata monolingual language
59	More?
60	How about Sumerian language wiki source - which currently is not MW language
61	Language allows defining language codes with dashes, which are considered variants
62	A: Maybe we could have a matrix/table: language code - allowed for content, allowed for localization, allowed for Wikidata
63	ACTION: Task for polite grammar de, nl, hu, jv, su
64	It is difficult for third party software like Kiwix when non-standard language codes are used
65	What language list does Commons App use?
66	We use the device language, users can also change
67	When you support structured data on Commons, how are you going to match this language code with the possibly non-standard Wikibase language code
68	N: This is also a problem in MW, as structured data can use language code that are MW allowed languages
69	A: What about fallback, this is also some kind of metadata. Do we have a task to add fallback data to language data?
70	N: It is in the task T190129. The provided list of fallbacks should probably be reviewed, as some of them might not make sense in certain use cases?
71	ACTION: add fallback information to language data
72	language data library is maintained/owned by the WMF Language
73	When you are not logged in and go to Wikidata, the UI is in English
74	Q: When do we get Wikidata monolingual lang codes to language data?
75	There should be a way to distinct what language code lists between different "contexts"
76
77

Jdlrobson awarded a token.Nov 14 2019, 10:50 PM

Krinkle moved this task from Under discussion to P2: Resource on the TechCom-RFC board.Apr 3 2020, 11:54 PM

• Nikerabbit closed subtask T218639: Make language-data installable as a proper library as Resolved.Jun 16 2020, 8:26 AM

Language team will work on this.

• Nikerabbit mentioned this in T255515: Babel: Language code sma and smj shows up as English in Babel boxes.Jun 16 2020, 1:43 PM

• Nikerabbit updated the task description. (Show Details)Jun 16 2020, 1:47 PM

• Nikerabbit edited projects, added Language-Team; removed MediaWiki-Installer.

• Nikerabbit updated the task description. (Show Details)Jun 16 2020, 1:49 PM

jhsoby awarded a token.Jun 16 2020, 1:56 PM

• Nikerabbit moved this task from Backlog to Scheduled on the Language-Team board.Jun 22 2020, 1:00 PM

@Nikerabbit Could you speak to who's impacted and in what way?

A few example questions to think about:

Do we currently take contributions to this data? If so where?
Does it currently all originate from (multiple places within) core, or also from places outside of it?
Does LangEng consider it self owner/steward over all those? Or is there someone/something we may want to inform, consult or collab with?
Would there be changes to how the data is currently accessed by anyone? E.g. PHP access within core/extensions, JS acess, API access (to the extent that it is available/exposed today) If so what does that look like before/after?
Are there some things expected to be changed to the logical shape of the data or discontinued?
Would the changes in any way (positive or otherwise) observable on-wiki from the UI, wikitext parsing, or in some other way? (Aside from JS)

Krinkle triaged this task as Medium priority.Jul 2 2020, 11:40 PM

In T190129#6275977, @Krinkle wrote:

I'll respond shortly as a comment for now, because this task is not part of current sprint.

A few example questions to think about:

Do we currently take contributions to this data? If so where?

Now: People can submit patches against MediaWiki core. Per established practices, cannot add info about languages which cannot be used as interface language.
Future: People can submit patches to https://github.com/wikimedia/language-data – LangEng can take care of integrating updates to core. Other users would probably have to update themselves.

Does it currently all originate from (multiple places within) core, or also from places outside of it?

Autonyms come from Names.php in core and for some languages from the local overrides in CLDR extension – possible to also add languages using $wgExtraLanguageNames. Writing direction and fallbacks comes from MessagesXX.php files (only for languages supported as interface languages). Script information is not available, nor are regions.

Translated language names come from CLDR, and that would not change.

The data in language-data library is sourced from multiple sources: Mostly MediaWiki core, with many additional languages and it is enhanced with country level data from CLDR.

Does LangEng consider it self owner/steward over all those? Or is there someone/something we may want to inform, consult or collab with?

We do not currently maintain core i18n (it's pretty stable), but we would maintain this library and do the initial integration. I believe Language Committee is aware of new languages.

Would there be changes to how the data is currently accessed by anyone? E.g. PHP access within core/extensions, JS acess, API access (to the extent that it is available/exposed today) If so what does that look like before/after?

We can integrate our library as backend for existing methods (in PHP, which are used by the API and frontend), but we would also likely consider creating a service for LanguageData (data as defined in the scope).

Are there some things expected to be changed to the logical shape of the data or discontinued?

Biggest thing is dropping requirement for being an interface language before we know the basic info of a language. More information would be available from all languages. One identified challenge is providing suitable sets of languages for different contexts, as not all languages are appropriate in all contexts. We have not clearly formulated a solution for this.

LocalisationCache would no longer be the backing store for this data. This could alleviate performance concerns of accessing basic info for multiple languages (if there any left).

Would the changes in any way (positive or otherwise) observable on-wiki from the UI, wikitext parsing, or in some other way? (Aside from JS)

As a side effect, it would be hopefully clearer to know where the set of available languages for each context is defined, something that has been unclear. {{#language}} function would know more languages than what it does currently.

Also, this would make it unnecessary to separately register new languages to MediaWiki after they have been exported from translatewiki.net the first time. The library would provide the other necessary info in most cases and it would be updated ahead of time (because translatewiki.net needs this info to enable translations).

VulpesVulpes825 subscribed.Jul 18 2020, 4:15 PM

Krinkle moved this task from P3: Explore to P4: Tune on the TechCom-RFC board.Aug 5 2020, 7:14 PM

• Nikerabbit mentioned this in T189036: Get MobileFrontend to use language-data.Oct 15 2020, 12:21 PM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:02 PM

Krinkle added a parent task: T32956: Make ResourceLoader a standalone library.Mar 18 2021, 12:04 AM

Krinkle mentioned this in T32956: Make ResourceLoader a standalone library.

Bugreporter mentioned this in T281067: merge CLDR extension to core.Apr 25 2021, 5:27 PM

Bugreporter mentioned this in T284870: Harmonize language codes and autonyms between language-data and Wikimedia wmgExtraLanguageNames.Jun 13 2021, 12:02 PM

• dev.kadirselcuk subscribed.Jun 13 2021, 12:06 PM