Page MenuHomePhabricator

ProofreadPage should recognize language specification in Index
Closed, ResolvedPublic

Description

On Wikisource, it's required to specify the language of a source. For example, this Balinese source (part of work on a current WMF grant) is specified as ban-bali, which means Balinese language in Balinese script.

However, as far as ProofreadPage is concerned, the language field is just one of various index fields that it does not try to interpret. This means that when processing pages in the Page namespace corresponding to an Index page, ProofreadPage does not pass along the language, and UniversalLanguageSelector cannot use it. For example, on this page (from the same source as the above Index page), the text is not marked as ban-bali and so cannot get the webfont specified in UniversalLanguageSelector (font should be Vimala). This is a problem for my work because most users do not have adequate Balinese fonts installed.

Can ProofreadPage be modified so that it recognizes one of the index fields as containing a language code, and passes it on as the language of corresponding pages in the Page namespace? It looks like we could just say that a field in the index data config with "language" in its data field will be treated this way.

Event Timeline

If so, being able to edit the Index page will be the same as being able to change page language. If we want to do this in ProofreadPage itself by default, for starters we need to have the extension assign the pagelang permission by default, presumably alongside pagequality in extension.json, and set $wgPageLanguageUseDB to true (unless we want to introduce some other method in core).

Change 618492 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/extensions/ProofreadPage@master] ProofreadPage: set page language based on index field

https://gerrit.wikimedia.org/r/618492

How would this interact with the ability to set the language for a single Page: page? How about mainspace transclusions of Page pages?

Is there any way to let +pagelang be limited to ProofreadPage-managed namespaces, or would the permission affect all namespaces?

This looks like a smart and useful facility to have (I see some potential for additional uses), but I'm also slightly worried about potential for abuse.

It's still possible to override the pagelang on individual pages in the Page namespace by using Special:PageLanguage as usual. Also, if you change the Index's langcode and the pagelang of anything in a corresponding Page doesn't match, it's left alone. Example:

  1. Index is set to langcode ban
  2. Page: pages automatically receive pagelang ban as each one is created
  3. Someone changes one Page: page's pagelang to en
  4. Someone changes the Index page's langcode to jv
  5. All Page: pages with pagelang ban at the time of the Index change are updated to jv, but others such as the one with pagelang en are not updated

Regarding the pagelang permission, I don't know if it's possible to limit it to ProofreadPage-managed namespaces. My patch simply checks current permissions -- it does not set or override them. It would depend on MediaWiki's ability to set permissions that way in LocalSettings.php.

You have a good point about transclusion. I haven't looked into that yet, but presumably the pagelang should be applied there too?

You have a good point about transclusion. I haven't looked into that yet, but presumably the pagelang should be applied there too?

I would assume so, but @kamholz is probably much better able to answer that than me.

Based on @Tpt's comments on Gerrit, it looks like it's a lot cleaner to override getPageLanguage in the PageContentHandler to return the language based on the current Index value, and not modify page_lang in the database. Working on an updated patchset now -- there are a couple other things to address beyond that.

New patchset is uploaded now. There is currently no validation of the Index page's language code, but that is arguably the preferable behavior, because Wikisource editors can add sources in whatever language they want for whatever reason they want, and thousands of languages will not be in the list known to LanguageNameUtils::getLanguageNames. To allow this freedom I think the Index field should not be further validated. It isn't currently possible to set the page language to a language not in the Names.php list (I guess for good reason? don't know), but at least it can be recorded accurately in the Index this way.

There is currently no validation of the Index page's language code, but that is arguably the preferable behavior, because Wikisource editors can add sources in whatever language they want for whatever reason they want, and thousands of languages will not be in the list known to LanguageNameUtils::getLanguageNames. To allow this freedom I think the Index field should not be further validated.

I don't think the language field should be completely unvalidated. In its current state, a user can (hypothetically) enter bn, Bengali, bengali, bangla, Bangla ( all of which refer to the Bengali language ) but only one of which will result in the desired change of the page language (bn).

I think an OOUI select widget which presents the user with a list of available languages while also allowing arbitrary user input could be a possible way to go. That would work in preventing users from making mistakes such as the above while also allowing users to specify unknown/unsupported languages if they need to.

I don't think the language field should be completely unvalidated. In its current state, a user can (hypothetically) enter bn, Bengali, bengali, bangla, Bangla ( all of which refer to the Bengali language ) but only one of which will result in the desired change of the page language (bn).

I think an OOUI select widget which presents the user with a list of available languages while also allowing arbitrary user input could be a possible way to go. That would work in preventing users from making mistakes such as the above while also allowing users to specify unknown/unsupported languages if they need to.

Sure, that could work. Just keep in mind that even for the select widget, this is one of those features which sounds much easier than it is to implement well. If the desired scope is to support the languages in Names.php with a few variant names for each one and doing an exact string match (other than case-folding and perhaps removing whitespace/punctuation), then it's feasible. And of course, even if many languages aren't in the select list or don't match properly, that may still be useful (but "false negatives" could also give the impression that languages aren't supported when they in fact are). Anyway, I've spent years working on this problem with the goal of supporting selection among all 7,000 world languages -- there are certainly ways to approach it but it isn't easy! But if the goal is limited to supporting valid page languages, then it would be a lot easier to provide some clues without having to worry about being exhaustive.

This (setting the page language for individual pages) is the exact reason I made this Pywikibot script (which I've neglected to follow up on): https://gerrit.wikimedia.org/r/c/pywikibot/core/+/478611
But I think the patch commited here is a much, much better approach. :-)

Change 618492 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] ProofreadPage: set page language based on index field

https://gerrit.wikimedia.org/r/618492

kamholz claimed this task.