Page MenuHomePhabricator

Consolidate language metadata into a 'language-data' library and use in MediaWiki
Open, MediumPublic

Assigned To
None
Authored By
Nikerabbit
Mar 20 2018, 9:40 AM
Referenced Files
F31077661: 20191114_142529.jpg
Nov 14 2019, 10:43 PM
Tokens
"Love" token, awarded by Nemo_bis."Like" token, awarded by waldyrious."Love" token, awarded by jhsoby."Like" token, awarded by Jdlrobson."Like" token, awarded by cscott."Love" token, awarded by Amire80.

Description

Affected components: MediaWiki core, Universal Language Selector, (also many extensions indirectly like Wikibase for getting better language data)
Engineer(s) or team for initial implementation: Language Team.
Code steward: Language Team.

Motivation

There is a growing need to have language metadata readily and efficiently available in different places inside and outside MediaWiki:

Definitions

The metadata in this context means information about languages that are often required to handle languages but are not exactly about localisation. There is often need to have the language metadata efficiently available for a large set of languages. Currently the metadata consists of:

  • language autonym (for language lists)
  • writing direction (for language lists, for displaying text tagged in a particular language)
  • writing script (for ULS language lists, possibly for fonts too)
  • regions where language is spoken (for ULS language lists, automatic language selection/suggestion with GeoIP)
  • fallback languages (for the fallback system to work without creating a circular dependency)

Description of issues

What we currently have in MediaWiki is only autonyms (Names.php) accessible via Language::fetchLanguageNames( Language::AS_AUTONYMS ). Writing direction and fallbacks are only available only by constructing appropriate Language objects or using Language::getLocalisationCache()->getItem(). This causes a related problem, that in order to set fallbacks or writing direction for a language, MessagesXX.php must be created, which again makes the language available for selection as an interface language in Special:Preferences. There is need to have this information available without making languages available as interface languages.

An existing database for that is https://github.com/wikimedia/language-data/ which already contains all of the above except language fallbacks.

Currently we have an overhead of updating the language metadata in multiple places: language-data itself, jquery.i18n (fallbacks), jquery.uls (copy of language-data), UniversalLanguageSelector extension (copy of jquery.uls and jquery.i18n) and MediaWiki core (Names.php, MessagesXX.php).

Exploration

To make the language metadata easy to use, and to reduce overhead of updating data in multiple different places the following actions are proposed:

1. T218639: Make language-data installable as a proper library

  • Make it installable via composer and/or npm
  • Make proper releases
  • Consider moving it to Gerrit

2. Bring language-data to MediaWiki core

  • Add the library as dependency
  • Decide which format to use (YAML or JSON)
  • Determine if additional caching or formats are required for performance (e.g. store as PHP code)

3. Add a mechanism for local overrides

  • Similarly how we can override plurals for CLDR data
  • Support two use cases: MediaWiki-customisations (e.g qqq, qqx, en-rtl) and site/farm specific customisations (outside git)

4. Replace Names.php with language-data

  • Keep the existing public APIs, but replace the data

5. Add/update PHP APIs to expose data from language-data

  • Similar to above, but with rest of the data
  • Consider adding a new API (PHP class) to access all the metadata in a uniform way (e.g. to access a direction of a language that is not available as an interface language)

6. Add/update Action APIs to expose data from language-data

7. Move language fallbacks to language-data

  • First copy everything to language-data, then bring in updated language-data, then update LocalisationCache to use fallbacks from there

8. Update ULS to use language-data from core if available

  • ULS should be able to use language-data from core if available (to pick up local customisations) before falling back to the shipped version
  • Optionally the shipped version can be dropped at later point of time (but that requires stripping it from jquery.uls that will still require it)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 415187 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/core@master] Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

Change 415187 abandoned by Jdlrobson:
Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

One thing that needs to be figured out is how to add languages to language-data where we don't know the native name of the language. Right now, the design of language-data makes that a requirement to be added, but there are many languages that could (and should) be supported in e.g. Wikidata and mul.Wikisource where we don't have that, making it impossible to add them properly.

We have languages not using native name already on the list. I doubt there are many cases where we cannot find one with some searching.

  • Consider whether it is a necessary to add new Action APIs to support frontend requirements

When Phab tasks are filed for that point, remember to add MediaWiki-Action-API so I'll see it.

Krinkle renamed this task from Consolidate language metadata into language-data and use it in MediaWiki core to Consolidate language metadata into a 'language-data' library and use in MediaWiki.Mar 20 2019, 7:19 PM

@Nikerabbit Is this a project that would be implemented by Language Team? And, are you currently interested in wider input and/or approval? It appears in pretty good shape, but unsure whether its ready to be implemented later, or whether there are still unanswered questions or uncertainties.

As a general exercise, I'd like the RFC to identify potential stakeholders or affected parties that we want to hear from at minimum. E.g. which changes would be breaking for whom (if any), and would it affect anyone's workflows, if so whose?

Yes the plan is that Language team would implement this in near future. Feedback on the general approach would be welcome (creating a library that is then brought in to the core and integrated). There are some uncertainties in the details, such as file format and caching with regards to performance. My gut feeling is that there is no absolute need to make breaking changes in any of the APIs, but there might be places where we see better way of exposing this data. I would like to know whether the current detail is sufficient for input and/or approval. I would also like to know if people have other wishes that closely relate to this work. For example @cscott has been working on bringing our language codes closer to standard ones. For me the ability to add languages to MediaWiki core without making them available as interface languages is a such wanted outcome. This might surface some meta discussion about more closely defining the list of languages available in each context (like for translatable wiki pages).

The workflow for adding new languages to the mediawiki would change a bit, but that's already mostly on Language team and translatewiki.net. One stakeholder would be Wikidata which needs this data for their additional languages which are not interface languages. Other stakeholders could be various (product?) teams and volunteers building frontend features (using MediaWiki APIs) or external tools (who need to have this data but have so far been implementing their own way or scraping data from MediaWiki core (mostly language fallbacks)).

Notes from uncoference session:

1Attendees: Niklas, Amir, Kelson, Leszek, Emmanuel
2
3
4Intro:
5list of places that contain language metadata
6MW: languages/data/Names.php
72x mobile apps: language lists for RTL
8MobileFrontend/Minerva: rtl languages
9Wikidata Monologual codes
10MW: special css rules for lin0height in some languages (should be for writing systems)
11Wikidata Lexeme
12Override for ULS
13CLDR
14
15
16https://github.com/wikimedia/language-data
17
18contains:
19ISO code
20autonym
21where it is spoken (continent)
22writing system (incl. directionality)
23fallbacks (planned)
24
25
26Notes:
27Amir: The following should be merged into central place:
28 MW: languages/data/Names.php
29 2x mobile apps: language lists for RTL
30 MobileFrontend/Minerva: rtl languages
31 Wikidata Monologual codes
32 MW: special css rules for lineheight in some languages (should be for writing systems)
33Q: Why not use ICU?
34A: Might not have all languages? Slow to upgrade.
35ACTION: Consider further using/integration with ICU?
36Panlex people claim Unicode contains grammatical rules for various languages. Would these also be in CLDR?
37Amir: not sure, would need to check.
38Amir: Why Wikidata maintains the custom list of language codes for monolingual codes?
39Leszek: to allow using language code on top of the list provided by Mediawiki to use in Wikidata statements.
40?: okay, so we have this library. Why not use other standard language libraries? Those are backed up by big consortia, which could update, maintain the data
41ACTION: PHP binding for language data
42A: We might have more languages
43A: Also corporate parties are generally not interested in smaller languages as these might not have monetary value
44Niklas: Wikimedia is actually member of Unicode. We also have a contact person at CLDR.
45Niklas: CLDR might also require that language has a written code
46Amir: Also for MediaWiki we don't want all languages from CLDR (e.g.extinct ones)
47Emmanuel What does
48ACTION: Mark which languages in language data can be content language for MW
49N: We should make it clear what lists serve which context. If we just merge all the list together, we would make it even harder to understand which language/language code is suited for which context
50ACTION: Share knowledge how does Kiwix use ICU.
51Why Wikidata has their own restricted language?
52ACTION: Document the policy adding stuff to language data
53Discussed spectifics of Wikidata Lexicographical Data. It currently does allow adding data in non-MW language codes (using the "mis" language code)
54There are better sources defining language codes/languages than MediaWiki, like Ethnologue,
55N: How many of those different language list do we need?
56 1. MW content languages
57 2. Languages that would be translation targeten
58 3. Wikidata monolingual language
59 More?
60How about Sumerian language wiki source - which currently is not MW language
61Language allows defining language codes with dashes, which are considered variants
62A: Maybe we could have a matrix/table: language code - allowed for content, allowed for localization, allowed for Wikidata
63ACTION: Task for polite grammar de, nl, hu, jv, su
64It is difficult for third party software like Kiwix when non-standard language codes are used
65What language list does Commons App use?
66We use the device language, users can also change
67When you support structured data on Commons, how are you going to match this language code with the possibly non-standard Wikibase language code
68N: This is also a problem in MW, as structured data can use language code that are MW allowed languages
69A: What about fallback, this is also some kind of metadata. Do we have a task to add fallback data to language data?
70N: It is in the task T190129. The provided list of fallbacks should probably be reviewed, as some of them might not make sense in certain use cases?
71ACTION: add fallback information to language data
72language data library is maintained/owned by the WMF Language
73When you are not logged in and go to Wikidata, the UI is in English
74Q: When do we get Wikidata monolingual lang codes to language data?
75There should be a way to distinct what language code lists between different "contexts"
76
77

@Nikerabbit Could you speak to who's impacted and in what way?

A few example questions to think about:

  • Do we currently take contributions to this data? If so where?
  • Does it currently all originate from (multiple places within) core, or also from places outside of it?
  • Does LangEng consider it self owner/steward over all those? Or is there someone/something we may want to inform, consult or collab with?
  • Would there be changes to how the data is currently accessed by anyone? E.g. PHP access within core/extensions, JS acess, API access (to the extent that it is available/exposed today) If so what does that look like before/after?
  • Are there some things expected to be changed to the logical shape of the data or discontinued?
  • Would the changes in any way (positive or otherwise) observable on-wiki from the UI, wikitext parsing, or in some other way? (Aside from JS)
Krinkle triaged this task as Medium priority.Jul 2 2020, 11:40 PM

I'll respond shortly as a comment for now, because this task is not part of current sprint.

A few example questions to think about:

  • Do we currently take contributions to this data? If so where?

Now: People can submit patches against MediaWiki core. Per established practices, cannot add info about languages which cannot be used as interface language.
Future: People can submit patches to https://github.com/wikimedia/language-data – LangEng can take care of integrating updates to core. Other users would probably have to update themselves.

  • Does it currently all originate from (multiple places within) core, or also from places outside of it?

Autonyms come from Names.php in core and for some languages from the local overrides in CLDR extension – possible to also add languages using $wgExtraLanguageNames. Writing direction and fallbacks comes from MessagesXX.php files (only for languages supported as interface languages). Script information is not available, nor are regions.

Translated language names come from CLDR, and that would not change.

The data in language-data library is sourced from multiple sources: Mostly MediaWiki core, with many additional languages and it is enhanced with country level data from CLDR.

  • Does LangEng consider it self owner/steward over all those? Or is there someone/something we may want to inform, consult or collab with?

We do not currently maintain core i18n (it's pretty stable), but we would maintain this library and do the initial integration. I believe Language Committee is aware of new languages.

  • Would there be changes to how the data is currently accessed by anyone? E.g. PHP access within core/extensions, JS acess, API access (to the extent that it is available/exposed today) If so what does that look like before/after?

We can integrate our library as backend for existing methods (in PHP, which are used by the API and frontend), but we would also likely consider creating a service for LanguageData (data as defined in the scope).

  • Are there some things expected to be changed to the logical shape of the data or discontinued?

Biggest thing is dropping requirement for being an interface language before we know the basic info of a language. More information would be available from all languages. One identified challenge is providing suitable sets of languages for different contexts, as not all languages are appropriate in all contexts. We have not clearly formulated a solution for this.

LocalisationCache would no longer be the backing store for this data. This could alleviate performance concerns of accessing basic info for multiple languages (if there any left).

  • Would the changes in any way (positive or otherwise) observable on-wiki from the UI, wikitext parsing, or in some other way? (Aside from JS)

As a side effect, it would be hopefully clearer to know where the set of available languages for each context is defined, something that has been unclear. {{#language}} function would know more languages than what it does currently.

Also, this would make it unnecessary to separately register new languages to MediaWiki after they have been exported from translatewiki.net the first time. The library would provide the other necessary info in most cases and it would be updated ahead of time (because translatewiki.net needs this info to enable translations).

Bugreporter subscribed.

In my opinion this should be priorized - see the dilemma on T273627/T277836 and my comment on T201509#4488401

Date and time formats are also language data which would probably gain to be included in this library (allowing a properer solving for T223772).