Page MenuHomePhabricator

ProofreadPage page body template
Closed, DeclinedPublic

Description

In the Balinese palm-leaf manuscript stuff I'm currently working on for Wikisource, the Balinese community wants each page of text in Balinese script to contain a Latin transliteration below it. (Think of it like a scholarly edition.) I'm preparing a patch to core that will add a parser function to generate transliterations, so for example {{#transliterate:ban-bali|ban-latn|ᬯᬬᬦ᭄}} will return wayan. What I'd like is for the transliteration appear similarly to this page on Palmleaf.org. As far as I can tell, this requires modifying ProofreadPage to do properly.

What I propose is to add support for a template MediaWiki:Proofreadpage_body_template. Then, in PageContent::getParserOutput, it will check if the template has been defined and if so, wrap the page body with it in the output. It will also pass in a lang parameter in order to permit conditional output depending on the language. This will allow Palmleaf.org style output with a template like this:

{{#ifeq: {{{lang}}} | ban-bali | {{{1}}}
==== Auto-transliteration ====
{{#transliterate: ban-bali | ban-latn | {{{1}}} }} | {{{1}}} }}

What do people think of this idea? I already have a patch ready that seems to work. It's only a few lines. There would be alternative ways to do it, but I like the template method because it leaves the formatting entirely up to the wiki administrator. For example, you could do <pre>{{{1}}}</pre> instead, display the two versions side by side, etc.

Event Timeline

Change 619877 had a related patch set uploaded (by David Kamholz; owner: David Kamholz):
[mediawiki/extensions/ProofreadPage@master] Check for template MediaWiki:Proofreadpage_body_template and use it to wrap Page body content. - first template parameter is raw content - second template paramter is raw content with <br> marking newlines - lang parameter is language code of page

https://gerrit.wikimedia.org/r/619877

@kamholz I don't understand what it is you're proposing to do here, nor see how it will have applicability outside just Balinese content. From whence comes #transliterate and what does it do? Why hard-code <br> inside ProofreadPage and provide two copies of the text? Why can this not be done with a normal template?

The only really novel part I see in your task description is automatic transliteration of Balinese into latin script, and that's not what this patch is about. Do you envision that it is possible to support automatic transliteration of other non-latin scripts or would this be solely applicable to Balinese?

I can kinda see the use for a PRP body template that gets passed various useful parameters (language being one of them, but of limited utility on the single-language Wikisourcen) and the raw page content. It'd provide a general hook that can be used for various things that can some times be useful. But it also adds complexity, and this isn't a facility I think I've ever run into a situation where I've wished for.

@kamholz I don't understand what it is you're proposing to do here, nor see how it will have applicability outside just Balinese content. From whence comes #transliterate and what does it do? Why hard-code <br> inside ProofreadPage and provide two copies of the text? Why can this not be done with a normal template?

The idea is that this template should apply to every Page namespace page on a wiki. It can be done in a normal template, but then you'd have to somehow ensure that every single Page namespace page is wrapped with that template in order to achieve uniformity -- is that really reasonable?

#transliterate will be implemented in a separate patch as part of a Balinese LanguageConverter. I'm still working out the implementation and discussing details with @cscott.

I agree that the <br> stuff is not necessary and there are better alternatives for what I was trying to do there.

The only really novel part I see in your task description is automatic transliteration of Balinese into latin script, and that's not what this patch is about. Do you envision that it is possible to support automatic transliteration of other non-latin scripts or would this be solely applicable to Balinese?

Yes, it's definitely possible for most non-Latin scripts. ICU comes with a ton of built-in rule-based transliterators of this sort, and the PHP intl extension has bindings for them.

I can kinda see the use for a PRP body template that gets passed various useful parameters (language being one of them, but of limited utility on the single-language Wikisourcen) and the raw page content.

As you may have guessed, the lang parameter is needed here to ensure that automatic transliteration on Multilingual Wikisource is only done for Balinese. What other parameters would be useful?

It'd provide a general hook that can be used for various things that can some times be useful. But it also adds complexity, and this isn't a facility I think I've ever run into a situation where I've wished for.

OK, but it's also not the case that any Wikisource language community has previously required Latin transliteration to accompany each page of non-Latin text. That is essential for the goals of the Balinese project, so I need to find some way to do it.

If we don't use the method in this patch, it would be theoretically possible to implement it with a gadget that generates transliterations via Ajax in the browser by passing wikitext to api.php and using #transliterate. But that requires three steps: (1) fetch the original wikitext, (2) send a parse request wrapping the wikitext with #transliterate, (3) insert the result into the DOM in the right place. For a single page in the Page namespace this is perhaps tolerable, but for transclusion on the main namespace page it probably isn't since it could be hundreds of transcluded pages.

Here is an example of how it looks on Palmleaf.org currently. It will not look exactly like this on Wikisource of course. Sections like Leaf 1, Leaf 2, Leaf 3 will correspond to pages in the Page namespace. The content prior to the "auto-transliteration" heading is what editors will type, and the transliteration will be added below. It should be interleaved like this, page by page, so that readers don't get lost. Given that, it makes sense to me to make it part of the parser output for each page. This makes transclusion work without further effort, and it means that editors can preview the output while editing the page (which, at least in the case of the Balinese work, definitely helps their proofreading efforts).

Looking into this a bit further, I'm starting to agree with @Xover that this is not a very good way of achieving what I want. Among the issues:

  • I want the output of #transliterate to be variable depending on the user's chosen transliteration variant. I don't believe there is a practical way to do this with the current parser. Storing the variant selection under the variant-ban user option will not affect the parser cache key at all. An alternative is making the variant appear as a user language option (not optimal since it's really only relevant for transliteration, not newly composed Balinese in Latin script) and having the user select it as their userlang, but again this has no impact on the parser cache key since the main page body is parsed with ParserOptions assuming an anonymous user. In short, the output of #transliterate is best generated separately from the page output and inserted using a gadget. This is still annoying for the reasons given above, but I'll just have deal with it -- the only alternative seems to be modifying the parser in ways that are probably not appropriate.
  • Applying the template in getParserOutput only affects the output when viewing the page in the Page: namespace, not when transcluding it via <pages> or {{:Page:Foo.pdf/1}}. To affect both would require modifying serializeContentInWikitext and unserializeContentInWikitext in PageContentHandler. This might work, at least one major problem remains: the body content would be passed to the template inline, like "{{:MediaWiki:Proofreadpage_body_template |$bodyText}}", meaning that any literal | character in the body's wikitext (not used in a table or for template parameters) would unexpectedly fail. It's not reasonable to expect authors of pages in the Page: namespace to always use {{!}} instead of |, since there's no clue why that should be necessary and it doesn't match how anything else works in MediaWiki.

Given these issues, it looks like I need to find alternative ways to achieve the same functionality.

Change 619877 abandoned by David Kamholz:
[mediawiki/extensions/ProofreadPage@master] Add support for page body template

Reason:
See discussion on Phabricator

https://gerrit.wikimedia.org/r/619877