Page MenuHomePhabricator

unpackDOMFragments doesn't handle all mandatory content model constraints (those that are enforced by the HTML5 tree builder on a HTML5 parse)
Open, MediumPublic

Description

[subbu@earth parsoid] echo '{{1x|[[Foo|<code><syntaxhighlight lang="php">$x</syntaxhighlight></code>]]}}' | parse.js --normalize
<p><a href="Foo" title="Foo"><code>
<div>
<div dir="ltr"><pre><span>$x</span></pre></div>
</div>
</code></a></p>

In this case, the syntaxhighlight extension produces a <div> which cannot be embedded inside a p-tag. Parsoid wraps that output in a DOM fragment and it is later unpacked into its container. But, that unpacking only handles a few mandatory content model constraints. P-in-P and DIV-in-P and other constraint that the tree builder enforces aren't being handled correctly.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Something messed up in template encapsulation clearly:

[subbu@earth parsoid] echo '{{wg|ResourceModuleSkinStyles}}' | parse.js --wt2wt --prefix mediawikiwiki
{{wg|ResourceModuleSkinStyles}}<syntaxhighlight lang="php" enclose="none">$wgResourceModuleSkinStyles</syntaxhighlight>

Parsoid is generating broken HTML output. A <p> tag cannot be nested inside another <p> tag. On parse (inside VE, or inside Parsoid during html -> wt), that p-in-p gets split up which breaks the continuity of template markup.

[subbu@earth parsoid] echo '{{wg|ResourceModuleSkinStyles}}' | parse.js --normalize --prefix mediawikiwiki
[warn/api/etimedout][mediawikiwiki/Main Page] Failed API request, {"error":{"code":"ETIMEDOUT"},"retries-remaining":1}

<p><a href="Special:MyLanguage/Manual:$wgResourceModuleSkinStyles" title="Special:MyLanguage/Manual:$wgResourceModuleSkinStyles"><code>
<div>
<p><code dir="ltr"><span>$wgResourceModuleSkinStyles</span></code></p>
</div>
</code></a></p>
[subbu@earth parsoid] echo '{{1x|[[Foo|<code><syntaxhighlight lang="php">$x</syntaxhighlight></code>]]}}' | parse.js --wt2wt
{{1x|[[Foo|<code><syntaxhighlight lang="php">$x</syntaxhighlight></code>]]}}<syntaxhighlight lang="php">$x</syntaxhighlight>

So, I think like unpackDOMFragments.js code is implicated (which I suspected and which is kind of a known issue) because that code doesn't fully handle all HTML5 content model constraints ... so in this case, when a DOM fragment containing a <div> or <p> is composed into a DOM node that is in turn nested inside another p-tag, we don't fix up the DOM appropriately. I think T141226 is a similar issue except for A-in-A, I think. We do have code to handle some common content model violations (A-in-A), but, that code needs a more generic refresh.

ssastry renamed this task from Specific template on mediawiki.org with <syntaxhighlight> inside it gets messed up when editing in VE to unpackDOMFragments doesn't handle all mandatory content model constraints (those that are enforced by the HTML5 tree builder on a HTML5 parse).May 11 2017, 10:47 PM
ssastry triaged this task as Medium priority.
ssastry added a project: Parsoid-DOM.
ssastry updated the task description. (Show Details)

Change 371726 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Stop shuttling link content through tree building as dom fragment tokens

https://gerrit.wikimedia.org/r/371726

The patch above fixes this specific case by not packing link content in dom fragment tokens and letting the treebuilder do its thing. The general limitations of unpacker's hasBadNesting() is well documented though. We probably want to fix that in concert with template balancing, where a more complete solution is required to enforce constraints.

Change 371726 abandoned by Arlolra:
Stop shuttling link content through tree building as dom fragment tokens

https://gerrit.wikimedia.org/r/371726

Might be worth adding a "serialize-then-parse and check you get the same tree" assertion at some point to try to catch these cases more aggressively. @ssastry mentioned that we sort of do this already with our RT testing framework. In theory we could do this in parser tests as well, but I was a little disappointed to see that we didn't already have a parser test case for [[Foo|<div></div>]] so adding that assertion to the parser test framework might not buy us any extra coverage or protection.