Page MenuHomePhabricator

make Parser::getTargetLanguage aware of multilingual wikis
Open, HighPublic

Description

Context:
Some wikis need the ability to localize page output to the user's preferred language. A good example would be Wikimedia Commons, which uses the Translate extension to localize help and policy pages, and the {{int:...}} parser function to localize file description pages. These mechanism also cause the parser cache to be split by user language.

Currently, core lacks a mechanism that would allow extensions to know in what language a page is currently being rendered. Parser::getTargetLanguage and ParserOptions::getTargetLanguage currently return the wiki's content language in nearly all cases, and give no indication of whether the content is actually being localized or not, and has by itself no impact on whether the parser cache gets split.

Proposal:

  • It should be possible for a wiki to specify that pages should be shown in the user language (could be done globally for all pages, or by namespace, or triggered by a magic word on the page). When rendering a page that is multilingual, ParserOptions::getTargetLanguage is set to the user language.
  • The parser cache, the {{int:...}} function, and other functionality that may depend on the user language, like formatting code for wikidata statements, would rely on ParserOptions::getTargetLanguage to tell them what language to use.
  • Parser::getTargetLanguage should return ParserOptions::getTargetLanguage unchanged, the logic currently in Parser::getTargetLanguage should be migrated to whatever code sets the target language in the options.
  • Possibly drop Title::getDisplayLanguage and ContentHandler::getDisplayLanguage completely, or move the functionality elsewhere (maybe into the Language object).

See also:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@daniel sounds right.

For pages that have sections in different languages, the page content language and the display language could differ in theory by section. This would probably be modeled best by having separate Content objects and separate Parser and ParserOptions objects for each such section, i.e. this would need some kind of composite content model.

I'm just suggesting that when we document this API we be explicit that it's the (display) language for the current section. Even if we don't actually support this yet in practice, let's define this API to Do The Right Thing when that happens.

Does parsoid have a similar concept?

This has been scheduled for discussion on IRC #wikimedia-office on November 11, 22:00 UTC (2pm PST), see E89: RFC Meeting: Parser::getTargetLanguage / PageRecord (2015-11-11)

@Spage no, Parsoid does not (yet). It will need to add such support when we implement <translate> and/or LanguageConverter support.

Copying some discussion from https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

I believe the title language support is for the LanguageConverter extension. They used to (ab)use the {{DISPLAYTITLE:title}} magic word in order to use the proper language variant, something like:

{{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

Then support was added to avoid the need for this hack, and just Do The Right Thing. I don't know the details, but presumably Title::getDisplayLanguage is part of it.

Then Brian Wolff <bawolff@gmail.com> wrote:

(As an aside, TOC really shouldn't split parser cache imo, and that's something I'd like to fix at some point [...])

Then you'll be interested in taking a look at T114057: Refactor table of contents.

Some more points from the discussion at https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

  • Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.
  • Anomie notes that it's unclear how links should be tracked for renderings of different languages
  • CScott nots that Title::getDisplayLanguage is probably used by the LanguageConverter extensions to avoid hacks like {{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

There are three other ways that variant information can be specified, which shouldn't be broken:

Could you describe how you would avoid cache and storage fragmentation

  • in RESTBase HTML storage,
  • in our CDN infrastructure?

Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.

Content language can also be altered by Special:PageLanguage in core and by the Translate extension, probably others.

This was discussed at the RFC meeting on IRC on November 11. Minutes from https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-11-11-21.59.html:

  • $wgContLang and ParserOptions::getUserLangObj() would be pretty much unused (DanielK_WMDE, 22:13:27)
  • for variants + Ex:Translate + Ex:Wikibase: instead of getting the desired output language from global state, it should be possible to get it from Parser and/or ParserOptions (DanielK_WMDE, 22:15:52)
  • <aude> afaik, there is inconsistency when {{int}} is used in how the cache is split vs. getTargetLanguage (DanielK_WMDE, 22:15:53)
  • use page language for localized parser function names, etc (DanielK_WMDE, 22:17:12)
  • IDEA: need to decide whether to deprecate $wgContLang and ParserOptions::getUserLangObj() (robla, 22:18:37)
  • https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html (DanielK_WMDE, 22:18:38)
  • <TimStarling> link trails and prefixes fall into the same category (DanielK_WMDE, 22:18:57)
  • transcluding pages into pages with a different page language could cause confusion wrt parser function names, etc (DanielK_WMDE, 22:24:07)
  • IDEA: distingiush between requested and effective target language. use *effective* target language when calling parser functions. (DanielK_WMDE, 22:25:56)
  • for Ex:Translate, Foo would be user language, but Foo/de would be page language, which would not be content language, but overwritten by the suffix. (DanielK_WMDE, 22:35:14)
  • A request lile https://zh.wikipedia.org/w/index.php?title=%E7%A7%91%E5%AD%A6&variant=zh-tw&uselang=fr should result in the effective target language being zh-tw (for content language zh). It should not be come fr, and not default to zh. (DanielK_WMDE, 22:39:48)
  • https://zh.wikipedia.org/zh-cn/%E7%A7%91%E5%AD%A6 is another way of writing an explicit 'variant=zh-cn' parameter, and should also be supported on zhwiki (cscott, 22:48:13)

Personal take-away from the discussion:

  • Agreement that we should not use global state to determine the output language when generating HTML. So it has to come from Content(Handler) and ParserOptions somehow.
  • Agreement that this is a good idea in general, but we should be careful not to break things. Nobody is sure what might break. Translate, ContentTranslation, and Variants are prime candidates for breakage, but no immediate issue was identified.
  • Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.
    • Variant selection is currently not reflected by $wgLang/RequestContext::getLanguage(), but probably should be.
  • We must keep apart the following:
    • the wiki's content language (site-wide default)
    • the user's interface language (possibly overwritten via uselang)
    • a page's content language (the language a page is written in - typically, but not always, the content language)
    • the desired display language for a page request (typically, but not always, the user language)
    • the effective display language of the page
  • ParserOptions::getUserLangObj() should probably go away

From this follows:

  • ParserOptions::getTargetLanguage() should return the desired output language. The desired language will usually be the wiki's content language, unless a page is considered "multilingual".
  • A page is considered "multilingual" either by virtue of its content model, or by per-namespace configuration (e.g. the File namespace on Commons)
  • Parser::getTargetLanguage() should return the effective output language.
  • ContentHandler::getPageLanguage() should return the page's content language.
  • ContentHandler::getPageViewLanguage() should be re-puposed to determine the effective target language based on the page's content language and the desired target language.
  • How page content language and desired target language are interpolated to form the effective target language depends on the content model:
    • for Wikibase entity pages, the content language is irrelevant, so the desired language would be the effective target language.
    • for regular wikitext pages, the content language is dominant, but automatic translation/transliteration can be applied to get closer to the desired output language, if such a translation is supported.
    • for multilingual wikitext pages (e.g. on commons), the content language is ignored, so the desired language would be the effective target language.
    • for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.
Next steps: write code that a) sets ParserOptions::getTargetLanguage b) make Parser::getTargetLanguage use ContentHandler to determine the effective language.

Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.

Why? Variants are about content language, not interface language.

the desired display language for a page request (typically, but not always, the user language)

This is also handled by $wgTranslatePageTranslationULS and compact interwikis, see there for todos.

ParserOptions::getTargetLanguage() should return the desired output language.

Sounds error prone, see above.

for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.

I don't see how this is different from the general case. It's the same in Translate, where each translation page, stored at a language code subpage, has a content language equal to said language code. Content model doesn't seem to be the matter.

The Wikidata team decided to work on this until the end of the year. We'll try to propose core patches, and discuss them at the summit.

@daniel -- do you have any slides or materials you want to present for this at the session tomorrow?

Strawman proposal, based on conversations at the session and after:

  1. Have the PHP parser track the language specified by <... lang="xxx"> tags in the source, and return the appropriate language code from getTargetLanguage. (If it is too hard to parse these, we can introduce some easier-to-parse form like {{#lang:foo}} but I'd prefer to use the markup which is already present in our content if we can.)
  2. Define a nonstandard language code to be used for "the current user interface language". Something like "x-ui", constructed so that it will never conflict with a valid HTML5 language code. This language code would be replaced on the fly in the parser with the appropriate language.
  3. Either change {{int:...}} to respect the current target language of the parser and localize the message returned, or if that breaks too much existing stuff, then introduce a new parser function (say, {{intx:...}}) which does so. (If we don't change {{int}}, then {{int}} should probably expand to <span lang="x-ui">{{intx:...}}</span> for consistency, and so that the output HTML is appropriately marked up.)

I *think* that is a complete solution to the problem. It may not be the best solution. It may not even be a complete solution. Help me improve/replace it.

Things I like about this: the <span lang="foo"> markup needed to make this work ends up in the resulting HTML. Various languages depend on proper language tagging of the output to (for example) select the proper shaper to use to render arabic script, or to perform word breaking correctly. So tagging the output is a good idea, and doing so makes message localization "Just Work".

@tstarling It seems to me like step (1) above is possible, but you understand the PHP parser better than I do. Can you see any showstoppers? The only thing that worries me is out-of-order parsing, which would make it impossible to use a simple stack to maintain the current target language as we parse.

@tstarling and @daniel think that it would be too hard to properly parse html-style tags in the preprocessor (in particular, finding the "matching" close tag so that the language stack is properly maintained).

So let's adjust the strawman to use {{#lang:foo|....content....}} for now, which should expand to <span lang="foo">...parsed content...</span> in addition to setting the current parser target language. We might need to tweak this a little more to allow generating <div lang="...">...</div> as well, sigh. Please help me paint that bikeshed.

T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments) could eventually be used to make it easier to include ...content... without having to deal with template-argument-escaping the content.

Related notes from the developer summit: T119022#1916790

Rather than {{#lang}}, what about using {{#tag:span|content|lang=de}} / {{#tag:span|content|lang=x-ui}}?

Benefit: no new parser functions added, just some extra code added to #tag processing to maintain the language stack.

Disadvantage: probably not as clear to read as a dedicated {{#lang}} tag; also content comes first and the language code comes last, which may be unexpected.

I wasn't aware of T69223… has that been considered here and also for multilingual pages?

I've been knee-deep in the PHP parser recently, so I might have a better handle on how to implement some of the ideas I presented above. Would it be worth prototyping the {{#lang}} or {{#tag:span}} options above, to see if they can actually work? Often I learn new things about the problem domain by trying to actually implement something.

In my opinion it is worth trying, but only you know if it is not away from some other important work.

RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM

Change 295549 had a related patch set uploaded (by Daniel Kinzler):
[WIP] improve semantics of Parser::getTargetLanguage.

https://gerrit.wikimedia.org/r/295549

Recapping some old discussion in E168: RFC Meeting: Support language variants in the REST API (2016-04-27, #wikimedia-office), there's the question of whether the "target language" and "user interface language" need to be distinct and/or specified separately. My strawman example is a user on zhwiki who has a target variant set to zh-hant but has the UX language (image metadata labels, {{int}} output, page UI) set to, say, de.

For last decade or so Wikimedia Commons was relying on MediaWiki:Lang message to fetch user's preferred language. This message can also be found on many other multilingual projects, like Wikidata, Meta-wiki, Wikispecies, MediaWiki, foundation, etc. Current standard interface is {{int:Lang}} in templates and frame:callParserFunction( "int", "lang" ) in Lua. From the perspective of someone writing templates and lua codes that use this mechanism, I do not see much need to change the current interface. Perhaps it would be nicer for all the wikis to use the same mechanism without a need to set it up separately on each wiki, with a subpage for each language, but I would prefer to stick to the current interface ({{int:Lang}}) as changing it would trigger a need to update a lot of templates and modules on a lot of wikis.

@Jarekt the proposal is not to remove {{int:lang}}, it's about how {{int:lang}} and similar things work internally.

daniel renamed this task from RFC: make Parser::getTargetLanguage aware of multilingual wikis to make Parser::getTargetLanguage aware of multilingual wikis.Jul 22 2019, 5:02 PM
daniel removed a project: TechCom-RFC.

Dropping from RFC board, since no RFC is needed.

One strawdog proposal is something like:

{{#wrapLang|<new-lang-code>|<content>}}

(which improves with heredocs, T114432)
which, in addition to properly setting the lang and dir tags on a <div> or <span> wrapper around the content, would also reset the Parser::getTargetLanguage() when parsing the content.

Using the special string user for the <new-lang-code> would set the target language to the user's UX language (whatever that is).

To go a step further, a new parser function named something like {{#int2|<msg name>}} (sorry, please bikeshed the name) would return the given message in the current *target language*. Then you can redefine {{#int}} to be equivalent to {{#wraplang|user|{{#int2:<msg name>}}}}.

That provides a more-or-less consistent definition of Parser::getTargetLanguage() as "the language we are right now emitted parsed content into", and we can actively discourage content/extensions from emitting content in a different language without doing the appropriate steps to set the parser target language appropriately.

If we don't change {{int}}, then {{int}} should probably expand to <span lang="x-ui">{{intx:...}}</span> for consistency, and so that the output HTML is appropriately marked up.

It shouldn’t: for example, {{int:lang}}, mentioned above by @Jarekt, should not emit any HTML – it’s used within HTML attributes, so extra markup would break things badly.

One strawdog proposal is something like:

{{#wrapLang|<new-lang-code>|<content>}}

(which improves with heredocs, T114432)
which, in addition to properly setting the lang and dir tags on a <div> or <span> wrapper around the content, would also reset the Parser::getTargetLanguage() when parsing the content.

I think there should be a version that doesn’t emit any HTML: for example, if the outermost element of a template is a template, built using wikitext syntax, you cannot just replace it with the parser function. In this case, a clean solution would be wrapping the table in a parser function that only sets the parser language (using heredoc for the parser function parameter to avoid having to escape the pipes of the table syntax), and manually setting the language on the table itself.

Even for less complex cases, being able to set the element (mostly span or div, but really anything) and attributes (class and style quite often, but also others) is important. Those could fit in the parser function syntax, but they don’t happen automatically.


Maybe it should be a parser tag rather than a parser function? While parser tags usually return content that’s transparent to templates processing the result, they can return [0 => $content, 'markerType' => 'none'] to avoid this behavior. Using a parser tag would:

  • Avoid the issues heredoc tries to address. As long as no </wraplang> appears in the content, it’s safe to put whatever content we want into it, including pipes and curly braces.
  • Allow specifying arbitrary attributes using a natural syntax.
  • Like parser functions and unlike regular HTML, still be processed early (templates processing the result would see the content already parsed in the right context) and not be implicitly closed (so we don’t have to worry about the content running to the end of the page because the editor forgot to close the tag).

So my proposal would be

<wraplang tag="div" lang="als" class="my-cool-class" style="float:right">
Alemannisch <!-- or toskërishtja? should `lang` be interpreted as a MediaWiki or as a BCP-47 language code? -->
</wraplang>

About the attributes,

  • lang should be required (a language code – it needs to be decided whether MediaWiki or BCP-47);
  • tag should be optional (an HTML element name allowed by MediaWiki) – if not specified, no HTML should be output (for the table example above);
  • dir should probably be forbidden (the parser tag handles it automatically);
  • all other attributes should be
    • allowed, optional, and simply forwarded to the resulting HTML tag if tag is specified;
    • forbidden if tag isn’t specified.