RFC: make Parser::getTargetLanguage aware of multilingual wikis
Open, HighPublic

Description

Context:
Some wikis need the ability to localize page output to the user's preferred language. A good example would be Wikimedia Commons, which uses the Translate extension to localize help and policy pages, and the {{int:...}} parser function to localize file description pages. These mechanism also cause the parser cache to be split by user language.

Currently, core lacks a mechanism that would allow extensions to know in what language a page is currently being rendered. Parser::getTargetLanguage and ParserOptions::getTargetLanguage currently return the wiki's content language in nearly all cases, and give no indication of whether the content is actually being localized or not, and has by itself no impact on whether the parser cache gets split.

Proposal:

  • It should be possible for a wiki to specify that pages should be shown in the user language (could be done globally for all pages, or by namespace, or triggered by a magic word on the page). When rendering a page that is multilingual, ParserOptions::getTargetLanguage is set to the user language.
  • The parser cache, the {{int:...}} function, and other functionality that may depend on the user language, like formatting code for wikidata statements, would rely on ParserOptions::getTargetLanguage to tell them what language to use.
  • Parser::getTargetLanguage should return ParserOptions::getTargetLanguage unchanged, the logic currently in Parser::getTargetLanguage should be migrated to whatever code sets the target language in the options.
  • Possibly drop Title::getDisplayLanguage and ContentHandler::getDisplayLanguage completely, or move the functionality elsewhere (maybe into the Language object).

See also:

daniel created this task.Oct 5 2015, 10:53 AM
daniel added a project: ArchCom-RfC.
daniel added a subscriber: daniel.
Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald TranscriptOct 5 2015, 10:53 AM
cscott added a comment.Oct 5 2015, 3:25 PM

What about pages which contain text in multiple languages? Can we add "current" to the description of the target language? Just like the HTML 'lang' attribute, which affects nested content inside the element, it would be nice to set the expectation that extensions can access the "current language" of the specific point in the page where the extension is referenced. Does that make sense?

daniel added a comment.Oct 5 2015, 4:01 PM

@cscott We should be careful to distinguish between:

  1. the user's preferred language.
  2. the page content language, i.e. the language of the text actually stored on the page.
  3. the display language, i.e. the language of the text actually shown to the user.

getTargetLanguage() should probably return (3), the desired display language, which would be derived from (1) and (2) somehow:

In the trivial case, all three are the same.

For the case of the Translate extension as used on Commons, the display language would be determined from the user language.

When looking at MediaWiki:Foo/fr however, the target/display language is fr, because the interface message magic takes precedence.

For the case of transliteration, the actual display language may be a compromise derived from the page content language and the user's preferred language.

For pages that have sections in different languages, the page content language and the display language could differ in theory by section. This would probably be modeled best by having separate Content objects and separate Parser and ParserOptions objects for each such section, i.e. this would need some kind of composite content model.

cscott added a comment.Oct 5 2015, 6:02 PM

@daniel sounds right.

For pages that have sections in different languages, the page content language and the display language could differ in theory by section. This would probably be modeled best by having separate Content objects and separate Parser and ParserOptions objects for each such section, i.e. this would need some kind of composite content model.

I'm just suggesting that when we document this API we be explicit that it's the (display) language for the current section. Even if we don't actually support this yet in practice, let's define this API to Do The Right Thing when that happens.

daniel moved this task from Inbox to Request IRC meeting on the ArchCom-RfC board.Oct 14 2015, 8:32 PM

Does parsoid have a similar concept?

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 4 2015, 9:57 PM
daniel edited the task description. (Show Details)EditedNov 10 2015, 7:08 PM

This has been scheduled for discussion on IRC #wikimedia-office on November 11, 22:00 UTC (2pm PST), see E89: RFC Meeting: Parser::getTargetLanguage / PageRecord (2015-11-11)

@Spage no, Parsoid does not (yet). It will need to add such support when we implement <translate> and/or LanguageConverter support.

Copying some discussion from https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

I believe the title language support is for the LanguageConverter extension. They used to (ab)use the {{DISPLAYTITLE:title}} magic word in order to use the proper language variant, something like:

{{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

Then support was added to avoid the need for this hack, and just Do The Right Thing. I don't know the details, but presumably Title::getDisplayLanguage is part of it.

Then Brian Wolff <bawolff@gmail.com> wrote:

(As an aside, TOC really shouldn't split parser cache imo, and that's something I'd like to fix at some point [...])

Then you'll be interested in taking a look at T114057: Refactor table of contents.

daniel edited the task description. (Show Details)Nov 11 2015, 8:09 PM

Some more points from the discussion at https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

  • Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.
  • Anomie notes that it's unclear how links should be tracked for renderings of different languages
  • CScott nots that Title::getDisplayLanguage is probably used by the LanguageConverter extensions to avoid hacks like {{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

There are three other ways that variant information can be specified, which shouldn't be broken:

GWicke added a subscriber: GWicke.Nov 11 2015, 10:51 PM

Could you describe how you would avoid cache and storage fragmentation

  • in RESTBase HTML storage,
  • in our CDN infrastructure?

Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.

Content language can also be altered by Special:PageLanguage in core and by the Translate extension, probably others.

This was discussed at the RFC meeting on IRC on November 11. Minutes from https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-11-11-21.59.html:

  • $wgContLang and ParserOptions::getUserLangObj() would be pretty much unused (DanielK_WMDE, 22:13:27)
  • for variants + Ex:Translate + Ex:Wikibase: instead of getting the desired output language from global state, it should be possible to get it from Parser and/or ParserOptions (DanielK_WMDE, 22:15:52)
  • <aude> afaik, there is inconsistency when {{int}} is used in how the cache is split vs. getTargetLanguage (DanielK_WMDE, 22:15:53)
  • use page language for localized parser function names, etc (DanielK_WMDE, 22:17:12)
  • IDEA: need to decide whether to deprecate $wgContLang and ParserOptions::getUserLangObj() (robla, 22:18:37)
  • https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html (DanielK_WMDE, 22:18:38)
  • <TimStarling> link trails and prefixes fall into the same category (DanielK_WMDE, 22:18:57)
  • transcluding pages into pages with a different page language could cause confusion wrt parser function names, etc (DanielK_WMDE, 22:24:07)
  • IDEA: distingiush between requested and effective target language. use *effective* target language when calling parser functions. (DanielK_WMDE, 22:25:56)
  • for Ex:Translate, Foo would be user language, but Foo/de would be page language, which would not be content language, but overwritten by the suffix. (DanielK_WMDE, 22:35:14)
  • A request lile https://zh.wikipedia.org/w/index.php?title=%E7%A7%91%E5%AD%A6&variant=zh-tw&uselang=fr should result in the effective target language being zh-tw (for content language zh). It should not be come fr, and not default to zh. (DanielK_WMDE, 22:39:48)
  • https://zh.wikipedia.org/zh-cn/%E7%A7%91%E5%AD%A6 is another way of writing an explicit 'variant=zh-cn' parameter, and should also be supported on zhwiki (cscott, 22:48:13)

Personal take-away from the discussion:

  • Agreement that we should not use global state to determine the output language when generating HTML. So it has to come from Content(Handler) and ParserOptions somehow.
  • Agreement that this is a good idea in general, but we should be careful not to break things. Nobody is sure what might break. Translate, ContentTranslation, and Variants are prime candidates for breakage, but no immediate issue was identified.
  • Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.
    • Variant selection is currently not reflected by $wgLang/RequestContext::getLanguage(), but probably should be.
  • We must keep apart the following:
    • the wiki's content language (site-wide default)
    • the user's interface language (possibly overwritten via uselang)
    • a page's content language (the language a page is written in - typically, but not always, the content language)
    • the desired display language for a page request (typically, but not always, the user language)
    • the effective display language of the page
  • ParserOptions::getUserLangObj() should probably go away

From this follows:

  • ParserOptions::getTargetLanguage() should return the desired output language. The desired language will usually be the wiki's content language, unless a page is considered "multilingual".
  • A page is considered "multilingual" either by virtue of its content model, or by per-namespace configuration (e.g. the File namespace on Commons)
  • Parser::getTargetLanguage() should return the effective output language.
  • ContentHandler::getPageLanguage() should return the page's content language.
  • ContentHandler::getPageViewLanguage() should be re-puposed to determine the effective target language based on the page's content language and the desired target language.
  • How page content language and desired target language are interpolated to form the effective target language depends on the content model:
    • for Wikibase entity pages, the content language is irrelevant, so the desired language would be the effective target language.
    • for regular wikitext pages, the content language is dominant, but automatic translation/transliteration can be applied to get closer to the desired output language, if such a translation is supported.
    • for multilingual wikitext pages (e.g. on commons), the content language is ignored, so the desired language would be the effective target language.
    • for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.
Next steps: write code that a) sets ParserOptions::getTargetLanguage b) make Parser::getTargetLanguage use ContentHandler to determine the effective language.

Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.

Why? Variants are about content language, not interface language.

the desired display language for a page request (typically, but not always, the user language)

This is also handled by $wgTranslatePageTranslationULS and compact interwikis, see there for todos.

ParserOptions::getTargetLanguage() should return the desired output language.

Sounds error prone, see above.

for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.

I don't see how this is different from the general case. It's the same in Translate, where each translation page, stored at a language code subpage, has a content language equal to said language code. Content model doesn't seem to be the matter.

The Wikidata team decided to work on this until the end of the year. We'll try to propose core patches, and discuss them at the summit.

cscott added a comment.Jan 5 2016, 7:34 AM

@daniel -- do you have any slides or materials you want to present for this at the session tomorrow?

cscott added a comment.EditedJan 6 2016, 5:56 PM

Strawman proposal, based on conversations at the session and after:

  1. Have the PHP parser track the language specified by <... lang="xxx"> tags in the source, and return the appropriate language code from getTargetLanguage. (If it is too hard to parse these, we can introduce some easier-to-parse form like {{#lang:foo}} but I'd prefer to use the markup which is already present in our content if we can.)
  2. Define a nonstandard language code to be used for "the current user interface language". Something like "x-ui", constructed so that it will never conflict with a valid HTML5 language code. This language code would be replaced on the fly in the parser with the appropriate language.
  3. Either change {{int:...}} to respect the current target language of the parser and localize the message returned, or if that breaks too much existing stuff, then introduce a new parser function (say, {{intx:...}}) which does so. (If we don't change {{int}}, then {{int}} should probably expand to <span lang="x-ui">{{intx:...}}</span> for consistency, and so that the output HTML is appropriately marked up.)

I *think* that is a complete solution to the problem. It may not be the best solution. It may not even be a complete solution. Help me improve/replace it.

Things I like about this: the <span lang="foo"> markup needed to make this work ends up in the resulting HTML. Various languages depend on proper language tagging of the output to (for example) select the proper shaper to use to render arabic script, or to perform word breaking correctly. So tagging the output is a good idea, and doing so makes message localization "Just Work".

@tstarling It seems to me like step (1) above is possible, but you understand the PHP parser better than I do. Can you see any showstoppers? The only thing that worries me is out-of-order parsing, which would make it impossible to use a simple stack to maintain the current target language as we parse.

cscott added a comment.Jan 6 2016, 6:39 PM

@tstarling and @daniel think that it would be too hard to properly parse html-style tags in the preprocessor (in particular, finding the "matching" close tag so that the language stack is properly maintained).

So let's adjust the strawman to use {{#lang:foo|....content....}} for now, which should expand to <span lang="foo">...parsed content...</span> in addition to setting the current parser target language. We might need to tweak this a little more to allow generating <div lang="...">...</div> as well, sigh. Please help me paint that bikeshed.

T114432: [RFC] Heredoc arguments for templates (aka "hygenic" or "long" arguments) could eventually be used to make it easier to include ...content... without having to deal with template-argument-escaping the content.

daniel added a comment.Feb 5 2016, 3:11 PM

Related notes from the developer summit: T119022#1916790

Rather than {{#lang}}, what about using {{#tag:span|content|lang=de}} / {{#tag:span|content|lang=x-ui}}?

Benefit: no new parser functions added, just some extra code added to #tag processing to maintain the language stack.

Disadvantage: probably not as clear to read as a dedicated {{#lang}} tag; also content comes first and the language code comes last, which may be unexpected.

hoo added a comment.Feb 15 2016, 11:21 AM

I wasn't aware of T69223… has that been considered here and also for multilingual pages?

I've been knee-deep in the PHP parser recently, so I might have a better handle on how to implement some of the ideas I presented above. Would it be worth prototyping the {{#lang}} or {{#tag:span}} options above, to see if they can actually work? Often I learn new things about the problem domain by trying to actually implement something.

In my opinion it is worth trying, but only you know if it is not away from some other important work.

RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM
Scott_WUaS edited the task description. (Show Details)May 25 2016, 9:44 PM
Scott_WUaS added a subscriber: Scott_WUaS.
RobLa-WMF triaged this task as "High" priority.Jun 8 2016, 7:03 PM
RobLa-WMF added a subscriber: RobLa-WMF.

Belated priority update discussed in E187: RFC Meeting: triage meeting (2016-05-25, #wikimedia-office) (see log at P3179)

Change 295549 had a related patch set uploaded (by Daniel Kinzler):
[WIP] improve semantics of Parser::getTargetLanguage.

https://gerrit.wikimedia.org/r/295549

Recapping some old discussion in E168: RFC Meeting: Support language variants in the REST API (2016-04-27, #wikimedia-office), there's the question of whether the "target language" and "user interface language" need to be distinct and/or specified separately. My strawman example is a user on zhwiki who has a target variant set to zh-hant but has the UX language (image metadata labels, {{int}} output, page UI) set to, say, de.

daniel moved this task from Inbox to To Do on the User-Daniel board.Jan 5 2017, 7:02 PM