Page MenuHomePhabricator

Parsoid/legacy parser {{Pre}} template rendering difference
Open, HighPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

Check
https://en.wikipedia.org/w/index.php?title=Uniform_Resource_Identifier&useparsoid=1#Example_URIs
vs
https://en.wikipedia.org/w/index.php?title=Uniform_Resource_Identifier&useparsoid=0#Example_URIs

What happens?:

The Parsoid version shows all the <span> markup instead of colored text like in the legacy parser version.

What should have happened instead?:

It's not clear to me that this markup is actually valid, but since it does work with the legacy parser, there's probably something to be done there.

Note
Smaller reproducer:

$ echo "{{Pre|{{color|rgb(0,76,178)|userinfo}}}}" | php ./bin/parse.php
<style data-mw-deduplicate="TemplateStyles:r1057110237" typeof="mw:Extension/templatestyles mw:Transclusion" about="#mwt1" data-parsoid='{"pi":[[{"k":"1"}]],"dsr":[0,40,null,null]}' data-mw='{"parts":[{"template":{"target":{"wt":"Pre","href":"./Template:Pre"},"params":{"1":{"wt":"{{color|rgb(0,76,178)|userinfo}}"}},"i":0}}]}'>.mw-parser-output .pre-borderless{border:none}</style><pre class="pre" typeof="mw:Extension/pre" about="#mwt1" data-parsoid='{"stx":"html","src":"&lt;pre class=\"pre \" >&lt;span style=\"color:rgb(0,76,178)\">userinfo&lt;/span>&lt;/pre>"}' data-mw='{"name":"pre","attrs":{"class":"pre"},"body":{"extsrc":"&lt;span style=\"color:rgb(0,76,178)\">userinfo&lt;/span>"}}'>&lt;span style="color:rgb(0,76,178)">userinfo&lt;/span></pre>

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

That is because template expansion returns this.

$ echo "{{Pre|{{color|rgb(0,76,178)|userinfo}}}}" | php ./bin/parse.php --dump tplsrc
[dump] ============================ template source ============================
TEMPLATE:Template:PreTRANSCLUSION:"{{Pre|{{color|rgb(0,76,178)|userinfo}}}}"
--------------------------------------------------------------------------------
<templatestyles src="Pre/styles.css"/><pre class="pre " ><span style="color:rgb(0,76,178)">userinfo</span></pre>

I think this is somehow relying on different behavior in Parser.php when HTML is being generated (vs. templates are being expanded). It might rely on arguments being processed before the template being expanded in this case? To be investigated (by whoever takes this up).

I think it's actually the following bit in Template:Pre:

<pre<includeonly></includeonly> class="pre ...

I bet the <includeonly> there (included to deliberately workaround /something or other/ in the legacy parser, I'm sure) is tripping Parsoid up, just because <includeonly> is handled slightly differently from other transclusions. If this were <pre{{1x}} class= I bet it would work fine in Parsoid. (Might break legacy in that case, for whatever reason the <includeonly> was originally added.)

I bet the <includeonly> there (included to deliberately workaround /something or other/ in the legacy parser, I'm sure)

The <includeonly> placed directly after the <pre prevents the regexp $elementsRegex in buildDomTreeArrayFromText of the legacy preprocessor from matching the element as an extension tag, since the tag name needs to be followed by a space or closing bracket. When the includeonly gets dropped however, the <pre is left to recombine with the rest of the string class="pre ...> to form a valid html5 pre tag. So, as @cscott suspects, it is a workaround to avoid the semantics of the pre extension tag and allow wikitext syntax will be parsed in it, for example, the heading in,

{{Pre|<span>hello</span>

== hi ==
}}

as

<pre><span>hello</span>

<h2><span class="mw-headline" id="hi">hi</span></h2>
</pre>

That is because template expansion returns this.

As @ssastry points out, Parsoid gets the post-template expansion text (after the includeonly is stripped) and interprets the pre as an extensions tag.

All this is pretty well understood and by design in the documentation of the of the template,
https://en.wikipedia.org/wiki/Template:Pre

HTML and wikimarkup aren't disabled as in <pre>...</pre> and are rendered as usual (thus if a parameter contains any wikimarkup, enclose it in <nowiki>...</nowiki>); however, multiple spaces are preserved.

In order to get this to work in Parsoid, we could maybe introduce an attribute on the pre extension <pre parsewikitext="1"> (choose a better name) that gives the same semantics and then update the template to remove the hack.

Doing an insource:/"<pre<"/ search on enwiki shows a few other uses of the pattern.

In order to get this to work in Parsoid, we could maybe introduce an attribute on the pre extension <pre parsewikitext="1"> (choose a better name) that gives the same semantics and then update the template to remove the hack.

I like this proposal.

Change 992274 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Add attribute to pre extension to parse wikitext

https://gerrit.wikimedia.org/r/992274

Change 993051 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/core@master] [WIP] Add attribute to pre extension to parse wikitext

https://gerrit.wikimedia.org/r/993051

Discussed this during Tech forum today. tl;dr is I'm fine with an attribute for the <pre> extension tag to control this behavior, although I'd prefer that it's not called html because I don't want to encourage the "wikitext is a superset of HTML" confusion. The <pre> extension tag, even with this functionality enabled, still has quite a number of differences with its HTML counterpart. I'm fine with parsewikitext or just parsed or just wikitext or whatever, just leave "html" out of it. :)

A related issue is that as used on-wiki, the <pre> extension ostensibly requires access to the parent frame:

[[Template:Demonstration]]
<pre parsed=true>
This is my argument: {{{1}}}
</pre>

As presently implemented, the contents of the <pre> extension (in Parsoid land at least) are "raw text", and if we try to parse this to wikitext we can't properly expand the {{{1}}} because we don't have access to the parent frame.

*However* this distinction between "expanded wikitext" and "raw text" arguments is deeper than this, and we /already have/ a mechanism to pass the body contents of the extension tag "as expanded wiki text" to wit:

[[Template:Demonstration]]
{{#tag:pre|This is my argument: {{{1}}}}|parsed=true}}

This works as intended: the {{{1}}} is expanded in the frame of [[Template:Demonstration]] before the argument is passed to the implementation of the <pre> extension tag.

Veering a little bit off track, I'll point to T268144#7704327 and the general extension tag/parser function uniformity issues (T204370 will stand-in for that discussion). Part of the idea is that *any* argument ought to be able to be passed/fetched either "as raw text" or "as expanded wikitext", which roughly corresponds to "lazy" or "eager" evaluation of the arguments in the traditional programming languages sense. We showed above that the "body" argument for an extension tag can be passed in either form, depending on whether the html-ish <tag> or template-ish {{#tag:...}} syntax is used. It would be desirable to be able to do the same for *any* argument to a transclusion, and perhaps this can be part of the semantics of {T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments)}. That is, we already can pass an "expanded wikitext" argument like:

{{Foo|arg={{{1}}}}|bar}}

but if we wanted to pass the argument instead "as raw text" you might write it as

{{Foo|arg=<<<
some | raw | text | ignore | markup
>>>|bar}}

This is a little bit at odds with one of the motivating examples for heredocs (from T114432):

{{cite|id=“32412”|<<<
First person plural pronouns in Isthmus-Mecayapan Nahuat:

:''nejamēn'' ({{IPA|[nehameːn]}}) "We, but not you" (= me & them)
:''tejamēn'' ({{IPA|[tehameːn]}}) "We along with you" (= me & you & them)
>>>}}

In this example we very much wanted the "raw text" interpretation of the = characters, but we *did* want to eventually evaluate the wikitext and expand it.

The basic idea for a solution is that the parameters are passed in as a variant type which the implementation can "demand" as appropriate type from using an asFoo() method. In this case, the Cite extension would take the body argument and call body.asParsedWikitext() on it. When provided with a variant type containing raw text, the asParsedWikitext() method would parse it. When provided with a variant type containing "expanded wikitext" it would also skip the initial preprocessor/expand-templates state (to avoid double expansion) and parse it from there.

Similarly, .asExpandedWikitext() on the argument provides the usual value for compatibility with existing parser function etc implementations, regardless of whether it was passed as raw text or already as expanded wikitext. .asHtml is appropriate if the output is going to be spliced into an HTML output, and works regardless of whether the argument was provided as wikitext, as raw text, or as HTML (from the strip state; see T257606#9216471).