Page MenuHomePhabricator

Consider removing global $userLang from onPageContentLanguage hook
Closed, ResolvedPublic

Description

The value of Content->getPageLanguage() is currently not deterministic and thus means that its exposure in external APIs and page metadata exposed to client-side JS may be different based on user session and query parameters. I think it would be a significant improvement if ContentHandler did not rely on global state.

That leaves two options:

  1. Make this ContentHandler method explicitly depend on, and dependency-inject and pass down everywhere, a RequestContext when accessing the page language. This would require a lot of complexity and separation for some of the consumers where I think the logical model of the code would be best to not separate out a user-specific version of all those services.
  2. Deprecate and remove this parameter.

I'm proposing the latter.

\cc PE, @daniel As maintainer of ContentHandler; What are you thoughts on this?
\cc WikimediaIncubator, @Ebe123 As maintainer of the affected WikimediaIncubator extension. Could you describe what this is used for currently? This will help figuring out how we can best support that in a different way going forward.

The entry point for this problem seems to be ContentHandler::getPageLanguage(), […]. This method does not logically depend on the context/user language, except for where it passes $wgLang as third parameter to HookRunner::onPageContentLanguage.

This seems like a bad idea and not something that is imho reasonable to support in MediaWiki, and incompatible with the general model of how this feature works.

For one, it would break any assumption that this is persistable to the database.

I feared the Translate extension might depend on thit, but from a Codesearch query we find it is actually not recongising this parameter at all. Instead, it is LiquidThreads and WikimediaIncubator using these to make some of their wiki pages act like a special page, in that they automagically render in the UI language. This is strange since we already expose the use language to the parser and allow it to be used. Rather, the hook is additionally forcing the internal page language to be perceived as if it were the current user language. Perhaps in the short-mid term we can find a different way to support the underlying need there.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I am hoping that a resolution here can lead to a resolution of T161976 for which the fix was reverted due to T298659 and ultimately resulted in this issue. I believe if page content is available during the rendering of another page that the purported content language of the available page content should also be available during the page render (much like the content model already is).

[…] from a Codesearch query […]

Please note that in the old (static method-based) hook system you can’t count on event handlers having any particular name. Instead, you should grep for the hook name itself as well, i.e. without the on prefix. In this case, this brings no additional problematic extensions, but the CirrusSearch extension (rECIR tests/jenkins/Jenkins.php) – which recognizes this parameter, but ignores it – wasn’t caught by your query.

@Krinkle Sorry, I didn't notice that there was a question for WikimediaIncubator maintainers here (I'm the only currently active maintainer, though I wasn't yet when the task was created).

The only thing it is used for is to set the page language for info pages (such as Wp/crh) to the user's interface language. I'm not sure what the point of that is, actually. The template on the page is localized to your user language, but that doesn't depend on the page language being set (but rather just on {{int:}} syntax). So for our usecase, it would be unproblematic to remove $userLang from this hook. (It would need a change in the extension code though, but that change would be unproblematic.)

@jhsoby If it can be removed, that would make things a lot simpler for us. Thanks!

The only thing it is used for is to set the page language for info pages (such as Wp/crh) to the user's interface language. I'm not sure what the point of that is, actually. The template on the page is localized to your user language, but that doesn't depend on the page language being set (but rather just on {{int:}} syntax).

The point is that not everything can be set using {{int:}}: page directionality, number formats, genders, plurals… These are not human-readable texts that could be localized using MediaWiki messages. Out of these, page directionality is definitely used, others may or may not be (I don’t know how exactly the template works). Compare https://incubator.wikimedia.org/wiki/Wn/hu?uselang=ar and https://incubator.wikimedia.org/wiki/User:Tacsipacsi/sandbox?uselang=ar – they have the exact same wiki code, but they look differently because the page language of the former is overridden but the latter one’s isn’t.

Change 914026 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[mediawiki/extensions/WikimediaIncubator@master] Replace global $wgLang

https://gerrit.wikimedia.org/r/914026

Change 914026 merged by jenkins-bot:

[mediawiki/extensions/WikimediaIncubator@master] Replace global $wgLang

https://gerrit.wikimedia.org/r/914026

Krinkle triaged this task as Medium priority.

The Wikimedia Incubator part is done, so I guess only the ContentHandler part of this task remains.

Change 926620 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/PageLanguage@master] Hooks: Remove unused $userLang arg from onPageContentLanguage()

https://gerrit.wikimedia.org/r/926620

Change 926621 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/LiquidThreads@master] LqtDispatch: Change onPageContentLanguage() to not use $userLang

https://gerrit.wikimedia.org/r/926621

Change 926622 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaIncubator@master] Hooks: Change onPageContentLanguage() to not use $userLang

https://gerrit.wikimedia.org/r/926622

Change 926622 merged by jenkins-bot:

[mediawiki/extensions/WikimediaIncubator@master] Hooks: Change onPageContentLanguage() to not use $userLang

https://gerrit.wikimedia.org/r/926622

Change 926621 merged by jenkins-bot:

[mediawiki/extensions/LiquidThreads@master] LqtDispatch: Change onPageContentLanguage() to not use $userLang

https://gerrit.wikimedia.org/r/926621

The only thing it is used for is to set the page language for info pages (such as Wp/crh) to the user's interface language. I'm not sure what the point of that is, actually. The template on the page is localized to your user language, but that doesn't depend on the page language being set (but rather just on {{int:}} syntax).

The point is that not everything can be set using {{int:}}: page directionality, number formats, genders, plurals… These are not human-readable texts that could be localized using MediaWiki messages. Out of these, page directionality is definitely used, others may or may not be (I don’t know how exactly the template works). Compare https://incubator.wikimedia.org/wiki/Wn/hu?uselang=ar and https://incubator.wikimedia.org/wiki/User:Tacsipacsi/sandbox?uselang=ar – they have the exact same wiki code, but they look differently because the page language of the former is overridden but the latter one’s isn’t.

Change 926622 merged by jenkins-bot:

[mediawiki/extensions/WikimediaIncubator@master] Hooks: Change onPageContentLanguage() to not use $userLang

https://gerrit.wikimedia.org/r/926622

It looks like I haven’t been heard. 🙁

It looks like I haven’t been heard. 🙁

Everything still works the way it used to as far as I can tell, it's just using different methods to achieve the same result. You can check in https://incubator.wikimedia.beta.wmflabs.org/wiki/Wp/en and https://incubator.wikimedia.beta.wmflabs.org/wiki/Wp/en?uselang=ar (the {{PAGELANGUAGE}} I added doesn't work that well, because it needs purges, but you can see the directionality still changes).

It indeed seems to work, but

  • Neither one of you answered. I still don’t know whether @Krinkle actually heard me, or just happened to change the code in a way that makes it seemingly work.
  • It seemingly works. However, it doesn’t solve the problem stated in the task description (namely that the page language relies on global state and thus is nondeterministic), just pushes this dependence from core to the extension.

(the {{PAGELANGUAGE}} I added doesn't work that well, because it needs purges, but you can see the directionality still changes).

It’s quite an issue, although I don’t think it’s a regression. And it actually probably won’t happen in production, because using {{int:}} splits the parser cache on user language. But relying on other mechanisms for avoiding cache pollution is not good.

@Tacsipacsi I haven't addressed your comments since I think the code changes made so far are independent of that. You're probably right that there is a better way to achieve whatever WikimediaIncubator and LiquidThreads need to achieve, without pulling in the language from the current request (I haven't really looked at what either extension does). But while we're discussing whether that's possible, we can make some changes so that only WikimediaIncubator and LiquidThreads pull in the language from request, instead of MediaWiki itself doing that and letting every extension access it. It seems to me that @Krinkle's patches are the first step towards doing that. (The next step will be deprecating the hook and proving a new one without the language parameter.)

@Tacsipacsi You're right. There's a gap here in what we described.

However, it doesn’t solve the problem stated in the task description.

As @matmarex points out, there's a subte but important distinction between an unconditional dependency, and a possible call. MediaWiki's model for Page and Content does not support a non-deterministic page language value. The language of a given title should always be the same when querying the database, whether from a batch API call, or a recent changes event, or the job queue. It is not valid for this to vary in this way, and never was. This technical debt is now isolated to the Incubator extension.

So:

  • Yes, it is still possible for Content->getPageLanguage to invoke the global state. If you have Incubator installed, and only when you're visiting a wiki and a page within that wiki where the Incubator extension decides to invoke that global state.
  • No, it is no longer a dependency for the Content->getPageLanguage method. It no longer imports $wgLang or RequestContext and no longer unconditionally and upfront invokes global state.

This means that MediaWiki as a software, can now safely avoid incidents like T302754 (assuming a wiki that doesn't install Incubator). It also means that next time something like that happens in production, the incident will be limited to (part of) Incubator, and not affect the rest of the sites.

I still don’t know whether @Krinkle […] just happened to change the code in a way that makes it seemingly work.

My patch changed Incubator to explicilty call RequestContext::getLanguage(), which is documented and supported as returning the language based on the user session and its language preference. This is not a coincidence, but also indeed still invokes global state.

Next step

I'd like to understand why Incubator can't use {{int:language}}. Per the above, we didn''t need to resolve this that for the first step. But, I am interested in helping with the next step as well.

Special pages can render interface wikitext in the user language. This is possible because 1) special pages are not stored as editable wiki pages in the database, and 2) do not offer their metadata, HTML, or wikitext content via an API, and 3) The result of parsing wikitext on special pages is not cached by MediaWiki. SpecialPage is mostly dynamic with parts composed of interface messages. The combined output is not cached, which means no ParserCache or mismatching ParserCache vary-key to worry about.

It seems that the only reason Incubator overrides the page language, is to change how the wikitext is parsed. From what I can see, Incubator is not doing this for the purposes of page metadata or APIs. That's good news as that means what we need to solve is fairly narrow. Potentially already solved.

[…] not everything can be set using {{int:}}: page directionality, number formats, genders, plurals… […]. Out of these, page directionality is definitely used […]

Commons and Meta-Wiki don't have an extension doing what Incubator does. Yet, has localised File pages and other namespaces for years without insurmountable technical limitations. This is mainly possible due to {{int:lang}}, whcih is like Incubator's {{int:language}} but defined as pages under MediaWIki:Lang instead of MediaWIki:Language. The Parser in MediaWiki is optimised to share the ParserCache between users whenever possible, and the {{int:}} function safely modifies the cache logic to keep each variant separately in the cache. This works fine from a functional perspective and is supported. Although for performance, it is important that article and article templates on big wikis do not do this!

What's interesting is that Commons also uses this mechanism to set the page direction. It uses using {{dir}} (Template:Dir) or sometimes via Lua (Module:Dir/RTL_overrides). The could be imported to Incubator and used as such. You can use it to set dir="" or style="direction: ;" or class="mw-content-ltr". Whichever you prefer!

Regarding the other parser features, I don't recall seeing grammar/plural/gender functions used in on-wiki content, not in regular articles, and also not in translated content. If there are cases where these are needed, I imagine there are workarounds for those as well. The info pages on Incubator feel like a small enough scope that no matter what kind of problem we encounter, there's likely a way to solve it within regular template and/or Lua modules. Alternatively, we can try to keep this working by finding a way to set $parserOptions->setTargetLanguage() from a hook in the WikimediaIncubator extension that only affects page views. That way it does not affect the Page metadata in other contexts.

It seems like the onArticleParserOptions() hook might work? Tagging Content-Transform-Team for awareness.

Change 935091 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/WikimediaIncubator@master] Migrate onPageContentLanguage() to onArticleParserOptions()

https://gerrit.wikimedia.org/r/935091

Krinkle changed the task status from Open to Stalled.Nov 13 2023, 5:42 PM
Krinkle moved this task from Soon to Within 2 Qs on the MediaWiki-Platform-Team board.
Krinkle moved this task from Within 2 Qs to Blocked/waiting on the MediaWiki-Platform-Team board.

Just cross-referencing T114640: make Parser::getTargetLanguage aware of multilingual wikis here, in particular T114640#9059438 which proposed a new parser function which would reset the parser's "target language", in case that would help with Incubator's use case.

Change 935091 merged by jenkins-bot:

[mediawiki/extensions/WikimediaIncubator@master] Migrate global state from onPageContentLanguage to onArticleParserOptions

https://gerrit.wikimedia.org/r/935091

matmarex changed the task status from Stalled to Open.Jan 9 2024, 9:42 PM
matmarex moved this task from Blocked/waiting to Current Sprint on the MediaWiki-Platform-Team board.

I think we could do this now. Let's try.

Change 989251 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] Pass site lang instead of user lang to 'PageContentLanguage' hook

https://gerrit.wikimedia.org/r/989251

\cc PE, @daniel As maintainer of ContentHandler; What are you thoughts on this?

There's a bunch of issues here...

For one thing, the modeling of page language and page view language and parser output language and request language is... a mess. I believe there is four things that need to be distinguished, and we are often confusing them:

  1. The default language associated with the page. That will typically be the site's content language, but may be different for pages like MediaWiki:Foo/fr and other translation subpages. We may want a hook to be able to override this, but it's a property of the page, it should not be determined by ContentHandler, and it should not depend on the user language.
  2. The language of a given piece of content (slot), overriding the default page language. This is an inherent property of the Content object, and it's what I had in mind when putting the getPageLanguage method on ContentHandler, but it was probably a mistake. I don't think we actually store the language in content. We could but we don't. We alwas rely on the page's title to determin the language.
  3. The language (and variant!) the user wants to see the content in. This impacts rendering and depends on user preference and request params and should be represented in ParserOptions. The parser/renderer determines if and how it will honour the user's wishes.
  4. The effective language of the parser output, as determiend by the parser/renderer, accessed through ContentHandler.

The current behavior of getPageLanguage is... none of these...

I don't think we actually store the language in content.

We store it in page_lang (if $wgPageLanguageUseDB is enabled; in production it is used on old Wikisource and by the Translate extension). This is handled in Title::getPageLanguage(), but that also falls back to all the other things (ContentHandler, the context language), resulting in very confusing behavior.

We store it in page_lang (if $wgPageLanguageUseDB is enabled; in production it is used on old Wikisource and by the Translate extension). This is handled in Title::getPageLanguage(), but that also falls back to all the other things (ContentHandler, the context language), resulting in very confusing behavior.

Yea, that's the confusion I was referring to. It's a property of the page, not a property of the content. ContentHandler has no business knowing or caring about it. ContentHandler could indicate what language the content is in, and we could set page_content based on that.

At the time I created ContentHandler, it seemed intuitive for Content to "have" a language associated. But we don't actually have a use case for that, it seems. I am not even sure we have a good use case for page_language. I now think the language should be derived from the title only.

Change 990428 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/GoogleTranslate@master] Remove $userLang parameter from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990428

Change 990429 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/PageLanguage@master] Deprecate $userLang parameter type from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990429

Change 990430 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/WikimediaIncubator@master] Deprecate $userLang parameter type from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990430

Change 990431 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] Stop unstubbing and passing $wgLang to onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990431

Change 990430 merged by jenkins-bot:

[mediawiki/extensions/WikimediaIncubator@master] Deprecate $userLang parameter type from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990430

Change 990428 merged by jenkins-bot:

[mediawiki/extensions/GoogleTranslate@master] Remove $userLang parameter from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990428

Change 990429 merged by jenkins-bot:

[mediawiki/extensions/PageLanguage@master] Deprecate $userLang parameter type from onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990429

Change 989251 abandoned by Bartosz Dziewoński:

[mediawiki/core@master] Pass site lang instead of user lang to 'PageContentLanguage' hook

Reason:

Basically the same as https://gerrit.wikimedia.org/r/c/mediawiki/core/+/990431

https://gerrit.wikimedia.org/r/989251

Change 990431 merged by jenkins-bot:

[mediawiki/core@master] Stop unstubbing and passing $wgLang to onPageContentLanguage hook handlers

https://gerrit.wikimedia.org/r/990431

matmarex removed a project: Patch-For-Review.

Considered and done.

Note that the current implementation of hook interfaces makes it impractical to remove the parameter (we could only deprecate/remove the hook entirely and introduce a new one, which would require changes to every extension using the hook, even those that don't use the changed parameter), therefore it was left in place, but it is now unused.

Change 989251 restored by Daniel Kinzler:

[mediawiki/core@master] Pass site lang instead of user lang to 'PageContentLanguage' hook

https://gerrit.wikimedia.org/r/989251

Change 989251 abandoned by Bartosz Dziewoński:

[mediawiki/core@master] Pass site lang instead of user lang to 'PageContentLanguage' hook

Reason:

https://gerrit.wikimedia.org/r/989251

Commons and Meta-Wiki don't have an extension doing what Incubator does. Yet, has localised File pages and other namespaces for years without insurmountable technical limitations. This is mainly possible due to {{int:lang}}, whcih is like Incubator's {{int:language}} but defined as pages under MediaWIki:Lang instead of MediaWIki:Language. The Parser in MediaWiki is optimised to share the ParserCache between users whenever possible, and the {{int:}} function safely modifies the cache logic to keep each variant separately in the cache. This works fine from a functional perspective and is supported. Although for performance, it is important that article and article templates on big wikis do not do this!

What's interesting is that Commons also uses this mechanism to set the page direction. It uses using {{dir}} (Template:Dir) or sometimes via Lua (Module:Dir/RTL_overrides). The could be imported to Incubator and used as such. You can use it to set dir="" or style="direction: ;" or class="mw-content-ltr". Whichever you prefer!

Besides parser cache splitting and performance issues, there are some downsides to how Commons uses Template:Dir. In most cases, the template is called as {{dir|{{int:lang}}}} and it is used so often that it is at the top of Special:MostTranscludedPages (currently at over 122 million) resulting in large unwieldy table growth. This begs the question of why it is implemented in Template space to to begin. Why not build it in MediaWiki where It seems int does not result in a transclusion link like msg, msgnw and default template instantiation, etc. does (nothing in MediaWiki namespace seems to be in Special:MostTranscludedPages on Commons)? At the very least most of those instantiations should likely be moved to be called from the MediaWiki user interface namespace.

I bring this up here because your comments seem to suggest Incubator could borrow some solutions from Commons and I just wanted to point out some of its caveats.

This might not be as big of a problem for Incubator as I doubt it has as many articles as there are files on Commons, but perhaps anyone at Incubator considering such a solution might want to avoid the issues Commons has with this.