Page MenuHomePhabricator

Make PageRecord::getLanguage() behave consistently with Title::getPageLanguage()
Open, MediumPublic

Description

Title::getPageLanguage() uses ContentHandler::getPageLanguage() and other special page logic to determine the effective page language.
Either PageRecord::getLanguage() (or PageStore) needs to do the same, or it needs to return null of the language is not set in the database, and leave it to the caller to determine the effective language.

Event Timeline

daniel triaged this task as High priority.Mar 26 2021, 10:10 PM

Change 677371 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] WIP: Make PageRecord::getLanguage() behave consistently with Title::getPageLanguage()

https://gerrit.wikimedia.org/r/677371

Change 677505 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] PageRecord: make language optional

https://gerrit.wikimedia.org/r/677505

Change 677371 abandoned by Daniel Kinzler:

[mediawiki/core@master] WIP: Make PageRecord::getLanguage() behave consistently with Title::getPageLanguage()

Reason:

For now, let's do Ic26f6f7690499b3dd87982e3822881fd473cfd68

https://gerrit.wikimedia.org/r/677371

The logic in Title::getPageLanguage() can't be ported to PageStoreRecord or PageStore without pulling in logic about ContentHandler into the new storage layer component. Worse, since ContentHandler::getPageLanguage can be (and is) overwritten by subclasses, we can't change its signature to no longer require a Title, which would mean we bind PageStore to title.

Some observations:

  • core never writes anything into page_lang in the database. But some extensions do, but the column is entirely empty at least on enwiki.
  • in core, the page language is the content language, except for two cases: in the MediaWiki namespace, the language is determined by a language code suffix on the title. And in the Special namespace, the page language is the user's UI language.
  • Some extensions override ContentHandler::getPageLanguage
  • Some extensions implement the PageContentLanguage hook
  • Conceptually, the content language is a property of Content object
  • In the contexts in which Title::getPageLanguage is currently used, a Content object tends to be available.
  • Conceptually, a page's content language should not depend on the user who views it. It's the language the page was written in (which could be 'und' or 'mul' as well).
  • The effective output language however may depend on user preferences or the current request (variants).

Proposal:
Introduce a PageContentLanguageLookup service, with a getPageContentLanguage( PageRecord $page ) method as a replacement for Title::getPageLanguagte.

One patch up for review. But that patch does not fully resolve this ticket.

None of the deployed extensions actually write or read page_lang field. The only wiki I managed to find that actually does have the field set is commons, with 136 pages where page_lang is not null - and all of them are main pages.

The feature is enabled via wgPageLanguageUseDB on Wikisource T175622 and on wikis using the translate extension T153209.

I guess that doesn't really mean much - even though the feature is not used a lot, it is used and we probably can't just remove it.

Introduce a PageContentLanguageLookup service, with a getPageContentLanguage( PageRecord $page ) method as a replacement for Title::getPageLanguagte.

I'm a bit worried about proliferation of one-method service objects. We now have ParserOutputAccess, you proposed PageContentAccess, now PageLanguageLookup. I don't quite know the answer to this concern though

Change 677505 merged by jenkins-bot:

[mediawiki/core@master] PageRecord: make language optional

https://gerrit.wikimedia.org/r/677505

I'm a bit worried about proliferation of one-method service objects. We now have ParserOutputAccess, you proposed PageContentAccess, now PageLanguageLookup. I don't quite know the answer to this concern though

I suppose moth methods can be implemented by the same class, and may even be in the same interface. Or we make PageContent an entity-style object that has getCurrentContent() and getPageLanguage() methods.

daniel lowered the priority of this task from High to Medium.Aug 17 2021, 3:42 PM