Page MenuHomePhabricator

HTML5TreeBuilder pass adds <meta> tags as children of <style> which won't show up in the DOM with a HTML5-aware DOM library
Open, LowPublic

Description

The HTML5TreeBuilder pass adds meta tags (well comments which are then converted to meta tags immediately after DOM building) to let us identify tree builder DOM fixups (ex: auto-inserted start/end tags). However, in some cases, it violates the content model of the parent node which can lead to those meta tags getting lost and hence incorrect inference about tree builder DOM fixups.

For example, meta tags cannot be added as children of <style> tags (introduced by the templatestyles extension). So, we need a content-model aware solution here.

Original bug / description below

See the DSR differences below for the templatestyles tag. Parsoid/JS infers a null starting dsr value where Parsoid/PHP correcs infers a 0 dsr value.

[subbu@earth:~/work/wmf/parsoid] echo "<templatestyles src='Template:Quote/styles.css'/>" | node bin/parse.js --body_only
<style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:Extension/templatestyles" about="#mwt3" data-parsoid='{"src":"&lt;templatestyles src=&apos;Template:Quote/styles.css&apos;/>","dsr":[null,49,0,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'>.mw-parser-output .templatequote{overflow:hidden;margin:1em 0;padding:0 40px}.mw-parser-output .templatequote .templatequotecite{line-height:1.5em;text-align:left;padding-left:1.6em;margin-top:0}</style>

[subbu@earth:~/work/wmf/parsoid] echo "<templatestyles src='Template:Quote/styles.css'/>" | php bin/parse.php --body_only
<style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:Extension/templatestyles" about="#mwt3" data-parsoid='{"dsr":[0,49,49,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'>.mw-parser-output .templatequote{overflow:hidden;margin:1em 0;padding:0 40px}.mw-parser-output .templatequote .templatequotecite{line-height:1.5em;text-align:left;padding-left:1.6em;margin-top:0}</style>

The Parsoid/JS behavior is obviously baffling. So, here is more info with --trace and --dump flags and an explanation follows:

First the Parsoid/JS output:

[subbu@earth:~/work/wmf/parsoid] echo "<templatestyles src='Template:Quote/styles.css'/>" | node bin/parse.js --body_only --debug html --dump dom:pre-process-fixups,dom:post-process-fixups > /dev/null
0-[HTML]       | {"type":"TagTk","name":"style","attribs":[{"k":"data-mw-deduplicate","v":"TemplateStyles:r886047036"},{"k":"data-mw","v":"{\"name\":\"templatestyles\",\"attrs\":{\"src\":\"Template:Quote/styles.css\"}}"},{"k":"typeof","v":"mw:DOMFragment"}],"dataAttribs":{"tsr":[0,49],"src":"<templatestyles src='Template:Quote/styles.css'/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0]}}
0-[HTML-DBG]   | Inserting shadow meta for style
0-[HTML]       | {"type":"EndTagTk","name":"style","attribs":[],"dataAttribs":{"tsr":[49,49],"tmp":{}}}
0-[HTML-DBG]   | Inserting shadow meta for style
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[49,50],"tmp":{}}}
0-[HTML]       | {"type":"EOFTk"}
----- DOM: pre-process-fixups -----
<body data-parsoid='{"tmp":{}}'><style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:DOMFragment" data-parsoid='{"tsr":[0,49],"src":"&lt;templatestyles src=&apos;Template:Quote/styles.css&apos;/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'></style><meta typeof="mw:EndTag" data-etag="style" data-parsoid='{"tsr":[49,49],"tmp":{}}'/>
</body>

----- DOM: post-process-fixups -----
<body data-parsoid='{"tmp":{}}'><style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:DOMFragment" data-parsoid='{"tsr":[0,49],"src":"&lt;templatestyles src=&apos;Template:Quote/styles.css&apos;/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0],"autoInsertedStart":true}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'></style><meta typeof="mw:EndTag" data-etag="style" data-parsoid='{"tsr":[49,49],"tmp":{}}'/>
</body>

Note a few things here:

  • HTML5TreeBuilder wt2html pass says it is inserting the start meta tag for the <style> tag
  • But, the DOM dump before the process-fixups pass doesn't show us that meta tag anywhere
  • This leads the process-fixups pass to mark the <style> tag as an auto-inserted tag (!!)
  • This effectively causes the DSR pass to give up its hands on the <style> tag wrt its starting offset (which makes sense from its POV).

Next, the Parsoid/PHP output

[subbu@earth:~/work/wmf/parsoid] echo "<templatestyles src='Template:Quote/styles.css'/>" | php bin/parse.php --body_only --debug html --dump dom:pre-process-fixups,dom:post-process-fixups > /dev/null
0-[HTML]       | {"type":"TagTk","name":"style","attribs":[{"k":"data-mw-deduplicate","v":"TemplateStyles:r886047036"},{"k":"data-mw","v":"{\"name\":\"templatestyles\",\"attrs\":{\"src\":\"Template:Quote/styles.css\"}}"},{"k":"typeof","v":"mw:DOMFragment"}],"dataAttribs":{"tsr":[0,49],"src":"<templatestyles src='Template:Quote/styles.css'/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0]}}
0-[HTML-DBG]   | Inserting shadow meta for style
0-[HTML]       | {"type":"EndTagTk","name":"style","attribs":[],"dataAttribs":{"tsr":[49,49],"tmp":{}}}
0-[HTML-DBG]   | Inserting shadow meta for style
0-[HTML]       | {"type":"NlTk","dataAttribs":{"tsr":[49,50],"tmp":{}}}
0-[HTML]       | {"type":"EOFTk"}
----- DOM: pre-process-fixups -----
<body data-parsoid='{"tmp":{}}'><style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:DOMFragment" data-parsoid='{"tsr":[0,49],"src":"&lt;templatestyles src=&apos;Template:Quote/styles.css&apos;/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'></style><meta typeof="mw:EndTag" data-etag="style" data-parsoid='{"tsr":[49,49],"tmp":{}}'/>
</body>

----- DOM: post-process-fixups -----
<body data-parsoid='{"tmp":{}}'><style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:DOMFragment" data-parsoid='{"tsr":[0,49],"src":"&lt;templatestyles src=&apos;Template:Quote/styles.css&apos;/>","tmp":{"setDSR":true,"tagId":1},"html":"mwf1","extTagOffsets":[0,49,49,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'></style><meta typeof="mw:EndTag" data-etag="style" data-parsoid='{"tsr":[49,49],"tmp":{}}'/>
</body>

Note a few things here:

  • HTML5TreeBuilder wt2html pass says it is inserting the start meta tag for the <style> tag
  • But, the DOM dump before the process-fixups pass doesn't show us that meta tag anywhere
  • But, the process-fixups pass does NOT say the <style> tag is auto-inserted which is of course right,. but how did it do that and why does this differ from Parsoid/JS?

So, the HTML5 content model for the <style> tag doesn't allow element nodes. So, the <meta> tag that the HTML Tree builder inserts as the first child of the <style> tag should ideally be lost. And, with Parsoid/JS that uses Domino, this is lost, and the rest follows. But, Parsoid/PHP uses libxml which doesn't know about HTML5 content model and happily leaves the meta tag behind. So, it is found as expected, and everything works properly (only because the libxml use cancels out the Parsoid bug). For the observant reader who might be wondering, the reason why the Parsoid/PHP DOM dump doesn't show us the starting meta tag in <style> is because we use the XMLSerializer to dump the DOM and XMLSerializer knows about style's content model and dumps the node value for it which doesn't include the meta tag content.

Now, if we replace libxml with a HTML5-aware DOM library, this Parsoid/JS bug will resurface in Parsoid/PHP. The underlying bug here is that the Parsoid HTML5TreeBuilder's meta-tag insertion code is broken. Ideally, we would stop using meta tags for this and use attributes, or register event callbacks into Remex's tree building and explore if we can identify auto-inserted status for tags that way.

So, anyway, the net result is because Parsoid/JS is unable to compute a DSR value for templatestyles in certain scenarios, selser cannot kick in for this node and introduces dirty diffs. Specifically, during rt-testing, for the cuwiki:Главьна_страница page with revid 76998, Parsoid/JS introduces a dirty selser diff, but Parsoid/PHP does not. This is a narrow edge case when a templatestyles tag is found at the start of the page as happens with this revid.

In other cases, the forward propagation pass of DSR computation updates the starting offset correctly. See below:

[subbu@earth:~/work/wmf/parsoid] echo "[[x]]<templatestyles src='Template:Quote/styles.css'/>" | node bin/parse.js --body_only 
<p data-parsoid='{"dsr":[0,54,0,0]}'><a rel="mw:WikiLink" href="./X" title="X" data-parsoid='{"stx":"simple","a":{"href":"./X"},"sa":{"href":"x"},"dsr":[0,5,2,2]}'>x</a><style data-mw-deduplicate="TemplateStyles:r886047036" typeof="mw:Extension/templatestyles" about="#mwt3" data-parsoid='{"dsr":[5,54,0,0]}' data-mw='{"name":"templatestyles","attrs":{"src":"Template:Quote/styles.css"}}'>.mw-parser-output .templatequote{overflow:hidden;margin:1em 0;padding:0 40px}.mw-parser-output .templatequote .templatequotecite{line-height:1.5em;text-align:left;padding-left:1.6em;margin-top:0}</style></p>

Event Timeline

ssastry created this task.
ssastry renamed this task from Fortuitous bug-fix in Parsoid/PHP (due to PHP DOM & Domino DOM differences) to HTML5TreeBuilder pass adds <meta> tags as children of <style> which won't show up in the DOM with a HTML5-aware DOM library.Apr 17 2020, 7:07 PM
ssastry updated the task description. (Show Details)