Page MenuHomePhabricator

Create an optimized language-fallback system for Maps internationalization based on investigations
Closed, ResolvedPublic

Description

Based on the investigation in T192700: Investigate an optimized language-fallback system for Maps internationalization and continuous testing and investigation, we've implemented a language fallback system per label on maps. The fallback system is designed to accommodate maps that are meant to be in informational articles in wikis in different languages. This purpose has been the main driver for how the behavior of language fallback for each label is eventually implemented, and will be tweaked if needed.

The algorithm

The fallback stages are as follows, where the process stops when a value is found:

  1. Look for value in the requested language
  2. Look for value in a language (or languages) that are specifically defined as fallback languages
  3. Look for a transliterated value
  4. Look for label in the local language

If no value is found, display no label.

Wikipedia itself already has a language fallback system for its interface translations; several languages have language fallbacks in case a specific translation is not found. We have collected this fallback structure into a JSON file and are using it as an initial fallback stage.

Specifications for the stages

Each stage follows the principles of assuming the desired map is intended for an informational article (rather than for needs of travel like Google maps, etc) and is relying on the language information and fallbacks that are already used in MediaWiki in general.

1. Value in the requested language

By default, this language is the language of the wiki. It is possible to override this language setting when using <mapframe> by using a parameter lang=.

For example, the map below:
<mapframe text="Downtown [[wikipedia:San Francisco|San Francisco]]" width=250 height=250 zoom=13 latitude=37.8013 longitude=-122.3988 />

If this map is posted in English Wikipedia, the default requested language (for any region specified) would be English (en). If the map is posted on Hebrew Wikipedia, the default language (for any region specified) would be Hebrew (he).

However, adding lang="es" would override the requested language to request the labels in Spanish, no matter where the map is posted.

(This will be available soon) Adding lang="local"will force the system to ignore any requested language, and display all labels in the local languages. This simulates the same behavior that exists before the i18n improvement.

2. Value in specified language fallback

MediaWiki uses language fallbacks already, especially for our UI translations. Not all languages have declared fallbacks, but those that do have a specific one that was specified by the communities and the Language Team. We have collected those directly from MediaWiki into a fallback JSON file: https://github.com/kartotherian/babel/blob/master/lib/fallbacks.json

If the specific language is not found, the system will look in that file to see if there are official fallback languages, and will attempt to get values in those languages.

3. Transliterated values

In OSM, there are values that are specified with a script suffix, like -Latn, -Cyrl and -Arab, as well as romanized data, like _rm suffix.

In this step, we look at the script of the originally-requested language (note: not the fallback languages, though those usually share the same script) and then we look for any value that has the suffix of the same script. If the requested language was a latin one, we add a sub-step where we also explicitly ask for romanized versions, like ja_rm for romanized Japanese or ko_rm for romanized Korean (both of these are part of the top 25 translated labels list).

4. Local language

If no label was found so far, we fall back to what OSM defines as the local language.

In some cases, OSM data has local language defined without that language having a definition of itself with the actual language code that it belongs to. For example, you may have a label in the USA (where the local language is considered English) that has local value "Portland" but does not have a label that is specifically declared to be in English. (More specifically, name field is filled in but not name:en)

See T192662: name:<local name code> is not always available in OSM

This is one of the reasons why we are trying not to be too forceful in creating more and more overrides before falling back to the language the system considers local.

Code

Event Timeline

Adding lang="local"will force the system to ignore any requested language

Did you consider using a value that conforms to the standard mechanism to create private use language codes, such as x-local or perhaps even mul?

@Mooeypoo, please see Nikerabbit's question above. Meanwhile, I think this is done and am moving to QA.

Adding lang="local"will force the system to ignore any requested language

Did you consider using a value that conforms to the standard mechanism to create private use language codes, such as x-local or perhaps even mul?

I wasn't familiar with those, and there didn't seem to be a full standard that I could find; local seems to be good enough in the sense that it's understandable for users (they'll need to fill it in when writing wikitext for <mapframe>) while being safe enough to never be an actual language code.

I think x-local and mul are less straight forward for users... is it crucial to try and change it?

We've announced and discussed lang=local pretty heavily, with a whole post about it and when you might use it on the project page. So I'd rather not change now unless we've really done something that is definitely wrong.

I think x-local and mul are less straight forward for users... is it crucial to try and change it?

I don't think it is worth changing it now. My advice would be to consider carefully in the future before introducing usages of values which do not follow the language code and language tag standards.

Checked in testwiki only (lang attribute does not work in betalabs currently).

The four stages described in the ticket are working as expected.
However, two important points to considered:
(1) https://github.com/kartotherian/babel/blob/master/lib/fallbacks.json file is limited in defining fallbacks for some common language, e.g. no fallbacks for Russian, Ukranian, Greek, Belorussian, Bahasa (Indonesian), Hebrew etc. English is not a fallback for any language. It results in displaying local names most of the time.

(2) I could not find the cases when the following happens:

Look for a transliterated value

I suspect that there are rather very limited support for such case.

To illustrate the limitations of the present method of fallbacks:
The following case (the screenshot of google maps of a Russian city viewed in Mongolian language) would not be present in maps.wikimedia - the fallbacks present there are to English and to Latin transliteration:

Screen Shot 2018-05-04 at 9.43.27 AM.png (733×730 px, 356 KB)

Not having more extended fallbacks results in the presence of mostly labels in local languages:
Viewing Israel map in Ukrainian will show no fallbacks (because there is no one):

Screen Shot 2018-05-04 at 3.03.40 PM.png (590×620 px, 427 KB)

In English and in Russian there will be many more labels translated in those respective languages:

Screen Shot 2018-05-04 at 3.05.20 PM.png (566×541 px, 378 KB)

Screen Shot 2018-05-04 at 3.04.30 PM.png (574×563 px, 411 KB)

QA Recommendation: Product should weigh in

I think x-local and mul are less straight forward for users... is it crucial to try and change it?

I don't think it is worth changing it now. My advice would be to consider carefully in the future before introducing usages of values which do not follow the language code and language tag standards.

Fair point. We did consider other values, like und and any which exist in a Unicode proposal, but it seemed to be a lot less clear to be in a user-facing interface, considering the user needs to know what to type in. They're both also not quite what 'local' is doing, so they seemed unsuitable. (Also, that was a proposal that didn't seem to pass yet, so it was unclear if those values really are considered preferably or standard).

But this is a very fair point; Whatever changes we consider next time that relate to this should involve a bigger effort to include more input.

(1) https://github.com/kartotherian/babel/blob/master/lib/fallbacks.json file is limited in defining fallbacks for some common language, e.g. no fallbacks for Russian, Ukranian, Greek, Belorussian, Bahasa (Indonesian), Hebrew etc. English is not a fallback for any language. It results in displaying local names most of the time.

This is something we should consider alongside the communities: The file was produced according to the base i18n interface fallbacks that we use in MediaWiki. Whatever language that falls back to English is not in that file, because in interface everything eventually falls back to English.

In maps, however, the situation is a little more complex, since we don't want to *always* fall back to English even in languages that do that in interface, since local labels exist (as opposed to nothing in interface).

Another issue specific to maps, is that if we have "English" fallback, it would, in the vast majority of cases, be applied instead of whatever local language; in effect, adding 'en' to the default fallback for a language will probably show 80%+ of the map in English. It's a fairly significant change.

We could absolutely add "en" (or any other language) as a fallback language to any other language through the fallbacks.json file, but we should not make that decision lightly. We should have a clear process whereby the community decides if they want a map-specific fallback for their language, and if there's community consensus, adding those to the file is straight forward.

The important bit here, in my opinion, is that for these things we have to have some community agreement. @CKoerner_WMF might be able to chip in with a better and clearer idea on how that might look like?

(2) I could not find the cases when the following happens:

Look for a transliterated value

I suspect that there are rather very limited support for such case.

We should try and see if we can identify cases with -Cyrl or -Latn etc in labels in OSM, and look for those specifically on the map. @Pnorman can we easily find an example of a couple of places where a -Cyrl label exists but a cyrillic language is missing (say, a label that doesn't have "ru" but has "foo-Cyrl") for more straight forward testing?

The important bit here, in my opinion, is that for these things we have to have some community agreement

Agree. It's also super hard to preemptively ask communites what they would prefer. Posting something to every Village Pump would give us far less feedback than we'd desire.

I'm not saying that's what you're asking here @MSchottlender-WMF, but just a truism as I think about this. :)

I tried to add clear documentation on how community members can request a change to the fallback chain. This would, presumably, also include requests to not fall back to English. I think this would suffice in most cases. If/When a community discovers a fallback they do not want, they can build consensus to change, and file a task for an update to the list. Does that sound agreeable?

https://www.mediawiki.org/wiki/Help:Extension:Kartographer#Map_language_fallbacks

The important bit here, in my opinion, is that for these things we have to have some community agreement

Agree. It's also super hard to preemptively ask communites what they would prefer. Posting something to every Village Pump would give us far less feedback than we'd desire.

I'm not saying that's what you're asking here @MSchottlender-WMF, but just a truism as I think about this. :)

I tried to add clear documentation on how community members can request a change to the fallback chain. This would, presumably, also include requests to not fall back to English. I think this would suffice in most cases. If/When a community discovers a fallback they do not want, they can build consensus to change, and file a task for an update to the list. Does that sound agreeable?

https://www.mediawiki.org/wiki/Help:Extension:Kartographer#Map_language_fallbacks

I think this is perfect. Thanks for writing and cleaning those up!

@Pnorman can we easily find an example of a couple of places where a -Cyrl label exists but a cyrillic language is missing (say, a label that doesn't have "ru" but has "foo-Cyrl") for more straight forward testing?

I don't know of any areas that meet that criteria, but if you could state the criteria more precisely I can search through a database.