Page MenuHomePhabricator

Expose the graph of language fallbacks in an API
Open, Needs TriagePublic

Description

As a tool developer, it would be very useful to access the language fallback graph from an API.
Ideally, in one API call, we would retrieve the entire graph:

{
   "frc": "fr",
   "de-formal": "de",
   "en-gb": "en",
   "pms": "it",
   ...
}

Currently, the only way to retrieve this seems to be to parse the localization files of MediaWiki and extract the parent language from each such file. However, the language fallback graph evolves as new languages are added, so it would be more convenient to expose that in an API. There is also the possibility of querying individual wikis to retrieve the parent languages one-by-one:
https://pt.wikipedia.org/wiki/Especial:ApiSandbox#action=query&format=json&meta=siteinfo&formatversion=2&siprop=general
From this result we obtain the fallback chain "pt" -> "pt-br" -> "en". But it's impractical to use this to extract the entire graph. It also will not work for retrieving fallbacks for language codes that don't have a corresponding wiki.

This could either be exposed by MediaWiki itself (given that the notion of language fallback lives there already) or by Wikibase (given that it already has a wbcontentlanguages API to expose the supported languages).

Somewhat related task: T197255.

Event Timeline

Pintoch created this task.Feb 27 2019, 1:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 27 2019, 1:37 PM
Mvolz updated the task description. (Show Details)
Mvolz added a project: Wikidata.
Mvolz awarded a token.Feb 27 2019, 1:47 PM
Anomie added a subscriber: Anomie.

As a tool developer, it would be very useful to access the language fallback graph from an API.

What exactly would you do with this information? i.e. what's the actual use case that makes you file this request?

A potential blocker to this is the same as for T74153 (under its original title, "meta=siteinfo should allow client to identify RTL languages"): actually collecting the fallbacks for all languages looks like it would require loading the language data for all languages, which seems likely to be prohibitive from a performance perspective.

or by Wikibase (given that it already has a wbcontentlanguages API to expose the supported languages).

Personally I don't think this would fit very well there, but I'll leave it to the Wikibase team to make the call on that point.

Pintoch added a subscriber: Mvolz.EditedFeb 27 2019, 4:40 PM

What exactly would you do with this information? i.e. what's the actual use case that makes you file this request?

I would use the resulting graph in https://tools.wmflabs.org/openrefine-wikidata/ . This tool is basically a wrapper over the Wikibase API (and a bit of SPARQL) to comply with OpenRefine's reconciliation API. The tool can be configured to work in any language and would benefit from knowing about the language fallback graph to retrieve labels, descriptions and aliases. The current code that does that is here:
https://github.com/wetneb/openrefine-wikibase/blob/master/wdreconcile/language.py
(it's a crude approximation, where every language falls back on English).
I will leave @Mvolz comment about her own use case.

Concerning performance, of course this would be only retrieved quite rarely and cached on my side. So maybe a web API is overkill for that - but it would ideally be good to be able to download that structured graph from somewhere. It does not have to be served by the MediaWiki instance itself.

Mvolz added a comment.Feb 27 2019, 6:25 PM

@Lucas_Werkmeister_WMDE would it make sense to expose the commented out equivalence information in https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/WikibaseContentLanguages.php#L135 ? In case we get back any language codes like those?

As far as I know, all those codes are MediaWiki-internal – for example, zh-classical is just some name that was picked for Classical Chinese at a time when the lzh language code (Literary Chinese) had not been assigned yet. Do you think we’re likely to encounter these language codes outside of MediaWiki?

Mvolz added a comment.Feb 27 2019, 6:43 PM

As far as I know, all those codes are MediaWiki-internal – for example, zh-classical is just some name that was picked for Classical Chinese at a time when the lzh language code had not been assigned yet. Do you think we’re likely to encounter these language codes outside of MediaWiki?

Ok, so they're safe to ignore, thanks!

I'm thinking this might make sense as two different tickets: for your use case @Pintoch you want the graph of the actual content language fall backs for all all the individual wikis, correct? And this would be a MediaWiki-API thing.

And then for my use case what I really want are the correct singular fallback (so only 2 deep, not much of a graph) going from the superset of wbclcontext=monolingual text language codes to the subset of wbclcontext=term and this would go in the wbcontent language api. @Lucas_Werkmeister_WMDE would you be amenable to merging such a thing?

A potential blocker to this is the same as for T74153 (under its original title, "meta=siteinfo should allow client to identify RTL languages"): actually collecting the fallbacks for all languages looks like it would require loading the language data for all languages, which seems likely to be prohibitive from a performance perspective.

With a small patch to Wikibase’s meta=contentlanguages API, I’m able to get the fallbacks for all monolingual text languages (486 language codes, not all of them supported interface languages) or all term languages (442 language codes, equivalent to MediaWiki’s interface languages as far as I’m aware) in less than one second. I think this is already optimized somewhere in LocalisationCache.

Change 493723 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Add fallbacks to wbcontentlanguages

https://gerrit.wikimedia.org/r/493723

Hm, okay, if the localization cache isn’t used (I usually always have it enabled, so I didn’t think to check this at first) then the request times out :(

Anomie added a comment.Mar 1 2019, 5:55 PM

I tried the following via shell.php to get some idea of the timing.

$t0 = microtime( true );
foreach ( Language::fetchLanguageNames() as $code => $name ) {
    Language::getFallbacksFor( $code, Language::STRICT_FALLBACKS );
}
$t1 = microtime( true );

On mwmaint1002, it took 0.04s. On my laptop, it took 148.14s for the first run, and 1.56s for a second, and back to 141s after I deleted the files in $wgCacheDirectory.

What I'm tempted to do is add a meta=languageinfo module, taking a list of language codes, to return information about arbitrary languages, and apply continuation after it has used a total of 2 seconds.