Page MenuHomePhabricator

Specify language fallback
Closed, ResolvedPublic

Description

How will language fall back work, i.e. in which order should languages be used when a different language does not exist? How should a label in a different language be displayed so it is visually clear that this label is in a specific language? How does this tie-in with Bug T38426 ?

See Also:
T40399: Multiple fallbacks kills server
T40439: Language factory uses too simplistic caching
T57998: Wikidata have to provide a syntax to show the properties in a language different from the wiki's content language
T39459: [Task] Don't try to add labels in non-existing languages: restrict to Language::isKnownLanguageTag

Details

Reference
bz36430

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:26 AM
bzimport set Reference to bz36430.
bzimport added a subscriber: Unknown Object (MLST).
Denny created this task.May 2 2012, 12:29 PM

How to handle the fact that we know the user languages from the preferences? How to tie this in in the fallback chain?

Consider facts like reality, i.e. the constraints through the parser cache for the display of items.

Have in mind bug #37461 - in some cases conversion is needed; in some cases it is not needed to specify that a label is in a specific language.

Think of handling cases of multilingual content. For example, item about the question mark may well have "?" as the label. Even if some languages would want to use "question mark" or "Fragezeichen", in other "?" will probably be better than a foreign language. Another example, ".com". Perhaps MUL code could be used for this.

Also, there is more complex question of names. For example, "Berlin" means "Berlin" in a large number of languages, so it seems a waste not to use this fact. Perhaps a way to specify which language is the "default" language of the item, so other language could draw from it if possible. In some cases, additional conversion might be needed.

Note also that some languages might have circular fallback, for example simple → en and en → simple.

For _global_ fallback chains the Language::getFallbacksFor( $langCode ) can be used, especially for the content. Likewise the _user_ can have a have a defined fallback chain, and we can use this first and thenbuild on this to create a complete fallback chain.

I am not sure that for label fallback we want to use the fallback that is used by the interface localization. The interface forms language hierarchy on the assumption that an interface message that is not defined in one language will always be defined in a parent language, or in English at last. Here we may well have the case that an item has a label in simple English but not in English.

There is a patchset that can be used for discussion at https://gerrit.wikimedia.org/r/#/c/15433/

This uses the current fallbacks from Language class (see $fallback in message files) if all languages fails. Typically the current fallback which goes right to english will be extended somewhat. For example Norwegian (bokmål) will use "nn,da,sv", Norwegian (nynorsk) will use "nb,sv,da", Swedish will use "nn,nb,da" and Danish will use "nb,nn,sv". All will have English added to the list by default.

If fallbacks are turned on, and if labels (or descriptions) are fallbacks, then they are flagged as such. That is a structure like the following are returned (the query is for labels in the "nb" language, but it is only found in "en")

http://localhost/repo/api.php?action=wbgetitems&ids=4&languages=nb&format=jsonfm
{
"items": {

		"4": {
			"id": 4,
			"labels": {
				"en": {
					"language": "en",
					"value": "Etnedal",
					"fallback": ""
				}
			}
		}

}
}

All languages specified in the call must fail for the fallbacks to be used. If they are used the languages list from the call is replaced one by one by the languages list from the fallbacks. If all fallback list fails then no labels (or descriptions) are returned.

It is only implemented for labels and descriptions in wbgetitems, fallbacks for modules that sets the labels and descriptions give no sense.

It seems to me that there might be some confusion between two kinds of language fallback, so I'd like to clarify.

One kind is fallback between various variants of the same language like sr-el → sr or zh-hant → zh. These are the variants specified with 'variant' query parameter on Wikipedia, except that on Wikidata there may be additional fallbacks like simple → en. This should even be invisible to the user who doesn't have to know what variant is actually in the database.

It is entirely another kind of fallback when we simply don't have the content in user's language or any variant and are supplying another language that we assume the user knows. In this case, language of the fallback should be visibly displayed to the user.

Whats implemented is global fallbacks initiated by the user language, because that can be cached as it creates a unique page wherever it is used for that specific user language.

There are two types of fallbacks, like you said, one for similar languages and one for forms in other writhing systems. The first one needs handling now, the later form is not so urgent. The later also builds on the first as we need to get the correct Language object to be able to create the variants. The later is also not prioritized for the moment.

In an ideal world with complete code we should try to find the global languages for a label, then limit them to the users chosen languages, then figure out which language transform we should use. If that fails we use the users language as a starting point for a fallback chain and use that one to find a label, then limit that to the users chosen languages, and then figure out which language transform to use. If everything fails we could then try all languages in the users preference list, and then if they all fails, we could fall back to using the item identifier itself.

The problem with that is basically workload, not only to generate the page but it will be a user specific page. As we now have more or less decided to turn of caching the last point is really not an issue.

For now the users own language is used as the starting point of the fallback chain in RecentChanges (and other places), but that could be changed to the users preferred languages. It is not clear how the preferred languages can be turned into a unique ordered list. In the wbgetitems API-call the supplied languages are tried first, and then the fallbacks are tried if the flag "fallback" is set. By setting the language with "uselang" other starting points than English can be used.

Reedy added a comment.Nov 3 2012, 3:02 PM

Note, lack of fallbacks also makes a problems with mainpages in en-gb and pt-br etc.

See https://www.wikidata.org/w/index.php?title=User_talk:Reedy&oldid=310868h

jeblad added a comment.Nov 3 2012, 3:37 PM

We also need a way to set up project specific language fallbacks, as our fallback chains may not be what other projects would prefer.

(In reply to comment #3)

Note also that some languages might have circular fallback, for example simple
→ en and en → simple.

Also: pt → pt-br → pt

(In reply to comment #5)

Here we may well
have the case that an item has a label in simple English but not in English.

On MW interface we can also have messages translated in pt but not in pt-br (and then the pt should be used) or translated in pt-br but not in pt (and pt-br should used).

(In reply to comment #9)

chain in RecentChanges (and other places), but that could be changed to the
users preferred languages. It is not clear how the preferred languages can be
turned into a unique ordered list. In the wbgetitems API-call the supplied

By "preferred languages" do you mean the ones defined at in the "translate-pref-editassistlang" field
https://www.wikidata.org/wiki/Special:Preferences?uselang=qqx#mw-prefsection-editing
? (that comes from [[mw:Extension:Translate]] IIRC)

jeblad added a comment.Nov 4 2012, 2:25 AM

Yes we know about circular references.
There are a number of examples given, and they are for variants of english but could equally well be for a number of other languages.
There is a set of preferred languages that isn't included in Phase I. Basically it is a list of languages the user has flagged a special interest in, so they are made visible or used as labels and so forth.

  • Bug 43321 has been marked as a duplicate of this bug. ***

Bump.

When real Wikidata started, my default interface was Serbian but I had to switch it to English because Wikidata in Serbian was unreadable and unusable. Everywhere you go, you see Q1234567 labels that are meaningless to you. Worse: you could see an entire item filled with statements like "P21 of this item is Q44148" that are even more meaningless.

I suggest that language fallback is temporarily implemented in simplified form: display the label in the user interface language. If it doesn't exist, display English label in angular brackets. If that doesn't exist, display nothing. This should work in 99.99% of cases, Wikidata will be more understandable to non-English readers, it will be easier for editors to enter the non-English labels and more people will use non-English interface languages. Full language fallback could be implemented later.

Qgil added a comment.Apr 23 2013, 9:12 PM

Just a note to say that Liangent has applied to GSoC with a proposal related to this report. Good luck!

https://www.mediawiki.org/wiki/User:Liangent/wb-lang

Qgil added a comment.Sep 17 2013, 4:16 PM

GSoC "soft pencils down" date was yesterday and all coding must stop on 23 September. Has this project been completed?

(In reply to comment #21)

GSoC "soft pencils down" date was yesterday and all coding must stop on 23
September. Has this project been completed?

Server-side changes (PHP) changes are almost done now. I stopped creating new pieces a little bit earlier to focus on amending existing code and getting them merged in these days. All client side stuff (JavaScript) are still TODO.

  • Bug 41495 has been marked as a duplicate of this bug. ***
Qgil added a comment.Oct 22 2013, 7:36 PM

If you have open tasks or bugs left, one possibility is to list them at https://www.mediawiki.org/wiki/Google_Code-In and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in that students participating in this programs have done a great work finishing and polishing GSoC projects, many times mentores by the former GSoC student. The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or you can contact me directly.

  • Bug 59151 has been marked as a duplicate of this bug. ***
  • Bug 60761 has been marked as a duplicate of this bug. ***
  • Bug 37461 has been marked as a duplicate of this bug. ***
  • Bug 66333 has been marked as a duplicate of this bug. ***
Lydia_Pintscher closed this task as Resolved.Nov 27 2014, 3:58 PM
Lydia_Pintscher claimed this task.
matej_suchanek set Security to None.
Restricted Application added a project: Wikidata. · View Herald TranscriptNov 20 2015, 6:00 PM