Page MenuHomePhabricator

Consolidate language metadata into a 'language-data' library and use in MediaWiki
Open, Needs TriagePublic

Description

There is a growing need to have language metadata readily and efficiently available in different places inside and outside MediaWiki:

Definitions

The metadata in this context means information about languages that are often required to handle languages but are not exactly about localisation. There is often need to have the language metadata efficiently available for a large set of languages. Currently the metadata consists of:

  • language autonym (for language lists)
  • writing direction (for language lists, for displaying text tagged in a particular language)
  • writing script (for ULS language lists, possibly for fonts too)
  • regions where language is spoken (for ULS language lists, automatic language selection/suggestion with GeoIP)
  • fallback languages (for the fallback system to work without creating a circular dependency)

Description of issues

What we currently have in MediaWiki is only autonyms (Names.php) accessible via Language::fetchLanguageNames( Language::AS_AUTONYMS ). Writing direction and fallbacks are only available only by constructing appropriate Language objects or using Language::getLocalisationCache()->getItem(). This causes a related problem, that in order to set fallbacks or writing direction for a language, MessagesXX.php must be created, which again makes the language available for selection as an interface language in Special:Preferences. There is need to have this information available without making languages available as interface languages.

An existing database for that is https://github.com/wikimedia/language-data/ which already contains all of the above except language fallbacks.

Currently we have an overhead of updating the language metadata in multiple places: language-data itself, jquery.i18n (fallbacks), jquery.uls (copy of language-data), UniversalLanguageSelector extension (copy of jquery.uls and jquery.i18n) and MediaWiki core (Names.php, MessagesXX.php).

Proposed implementation plan

To make the language metadata easy to use, and to reduce overhead of updating data in multiple different places the following actions are proposed:

1. T218639: Make language-data installable as a proper library

  • Make it installable via composer and/or npm
  • Make proper releases
  • Consider moving it to Gerrit

2. Bring language-data to MediaWiki core

  • Add the library as dependency
  • Decide which format to use (YAML or JSON)
  • Determine if additional caching or formats are required for performance (e.g. store as PHP code)

3. Add a mechanism for local overrides

  • Similarly how we can override plurals for CLDR data
  • Support two use cases: MediaWiki-customisations (e.g qqq, qqx, en-rtl) and site/farm specific customisations (outside git)

4. Replace Names.php with language-data

  • Keep the existing public APIs, but replace the data

5. Add/update PHP APIs to expose data from language-data

  • Similar to above, but with rest of the data
  • Consider adding a new API (PHP class) to access all the metadata in a uniform way (e.g. to access a direction of a language that is not available as an interface language)

6. Add/update Action APIs to expose data from language-data

7. Move language fallbacks to language-data

  • First copy everything to language-data, then bring in updated language-data, then update LocalisationCache to use fallbacks from there

8. Update ULS to use language-data from core if available

  • ULS should be able to use language-data from core if available (to pick up local customisations) before falling back to the shipped version
  • Optionally the shipped version can be dropped at later point of time (but that requires stripping it from jquery.uls that will still require it)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 20 2018, 9:40 AM

Change 415187 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/core@master] Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

Change 415187 abandoned by Jdlrobson:
Generate RTL languages via maintenance script

https://gerrit.wikimedia.org/r/415187

jhsoby added a subscriber: jhsoby.Aug 27 2018, 1:49 PM

One thing that needs to be figured out is how to add languages to language-data where we don't know the native name of the language. Right now, the design of language-data makes that a requirement to be added, but there are many languages that could (and should) be supported in e.g. Wikidata and mul.Wikisource where we don't have that, making it impossible to add them properly.

We have languages not using native name already on the list. I doubt there are many cases where we cannot find one with some searching.

cscott added a subscriber: cscott.
Nikerabbit updated the task description. (Show Details)
daniel moved this task from Inbox to Under discussion on the TechCom-RFC board.Jan 17 2019, 6:27 AM
Anomie added a subscriber: Anomie.Jan 19 2019, 3:14 PM
  • Consider whether it is a necessary to add new Action APIs to support frontend requirements

When Phab tasks are filed for that point, remember to add MediaWiki-API so I'll see it.

Krinkle renamed this task from Consolidate language metadata into language-data and use it in MediaWiki core to Consolidate language metadata into a 'language-data' library and use in MediaWiki.Mar 20 2019, 7:19 PM

@Nikerabbit Is this a project that would be implemented by Language Team? And, are you currently interested in wider input and/or approval? It appears in pretty good shape, but unsure whether its ready to be implemented later, or whether there are still unanswered questions or uncertainties.

As a general exercise, I'd like the RFC to identify potential stakeholders or affected parties that we want to hear from at minimum. E.g. which changes would be breaking for whom (if any), and would it affect anyone's workflows, if so whose?

Yes the plan is that Language team would implement this in near future. Feedback on the general approach would be welcome (creating a library that is then brought in to the core and integrated). There are some uncertainties in the details, such as file format and caching with regards to performance. My gut feeling is that there is no absolute need to make breaking changes in any of the APIs, but there might be places where we see better way of exposing this data. I would like to know whether the current detail is sufficient for input and/or approval. I would also like to know if people have other wishes that closely relate to this work. For example @cscott has been working on bringing our language codes closer to standard ones. For me the ability to add languages to MediaWiki core without making them available as interface languages is a such wanted outcome. This might surface some meta discussion about more closely defining the list of languages available in each context (like for translatable wiki pages).

The workflow for adding new languages to the mediawiki would change a bit, but that's already mostly on Language team and translatewiki.net. One stakeholder would be Wikidata which needs this data for their additional languages which are not interface languages. Other stakeholders could be various (product?) teams and volunteers building frontend features (using MediaWiki APIs) or external tools (who need to have this data but have so far been implementing their own way or scraping data from MediaWiki core (mostly language fallbacks)).