Page MenuHomePhabricator

RemexHTML should be able to parse HTML into an existing DOM node
Open, LowPublic

Description

Currently the way to parse a HTML fragment with Remex is along the lines of

$domBuilder = new DOMBuilder();
$treeBuilder = new TreeBuilder( $domBuilder );
$dispatcher = new Dispatcher( $treeBuilder );
$tokenizer = new Tokenizer( $dispatcher, $html, [] );
$tokenizer->execute( [
    'fragmentNamespace' => HTMLData::NS_HTML,
    'fragmentName' => 'div',
] );
$wrapper = $domBuilder->getFragment();
foreach ( $wrapper->childNodes as $node ) {
    // do something with the resulting DOM forest
}

When used for innerHTML-style funcionality, that means Remex will create a document, build the DOM tree within it, then we have to import the nodes into the document where the inner HTML replacement is being done. ID indexes get lost during importing (although right now Remex doesn't support them anyway; that's T217696). It would be simpler and less error-prone if Remex could work within a given document (either with a detached fragment wrapper node, or using a specified node in the document for that).

Event Timeline

Aside: AIUI fragmentName should determine the type of the element returned by $domBuilder->getFragment(), but in practice it always seems to be a HTML element. That probably leads to subtle bugs of its own when the HTML is invalid within the context of the parent element (e.g. someone wants to set the inner HTML of a <table> tag and the result needs reparenting).

Change 617282 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] One document to rule them all

https://gerrit.wikimedia.org/r/617282

In theory TreeBuilder should work fine w/ various parent contexts, that's how the parent WHATWG/W3C HTML parsing spec works. Haven't taken a good look at Remex yet to see how hard that is to do 'properly'...

Change 622425 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Remove special case for the html extension

https://gerrit.wikimedia.org/r/622425

Change 622425 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/622425

Change 625641 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a8

https://gerrit.wikimedia.org/r/625641

Change 625641 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a8

https://gerrit.wikimedia.org/r/625641

Change 617282 merged by jenkins-bot:
[mediawiki/services/parsoid@master] One document to rule them all

https://gerrit.wikimedia.org/r/617282

It would be simpler and less error-prone if Remex could work within a given document (either with a detached fragment wrapper node, or using a specified node in the document for that).

The spec seems to suggest creating a new document when parsing an HTML fragment,
https://html.spec.whatwg.org/#html-fragment-parsing-algorithm

The spec seems to suggest creating a new document when parsing an HTML fragment,

That's an internal detail. The DOM Parsing spec says the created fragment should be part of the context document, not a new document (Let fragment be a new DocumentFragment whose node document is context element's node document.) So the current behavior (or the one at the time of filing the bug, anyway; I have not checked recently) is clearly incorrect.

That's a bit different from what the task asks for (providing an option to parse into a node instead of a DOMDocumentFragment, given that will be the end goal for 99% of use cases) but not having to do an import (which in PHP's not-quite-compliant implementation is an extra source of fragility) would already be an improvement.

Whether or not strictly following the standard in the parsing steps and actually creating a new document is worth the presumable performance hit is also an internal detail, but worth considering, IMO.

Arlolra subscribed.

Change 635100 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a12

https://gerrit.wikimedia.org/r/635100

Change 635100 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a12

https://gerrit.wikimedia.org/r/635100

Change 662672 had a related patch set uploaded (by Paladox; owner: Arlolra):
[mediawiki/services/parsoid@REL1_35] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/662672

Change 662672 merged by jenkins-bot:
[mediawiki/services/parsoid@REL1_35] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/662672

Change 677986 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@REL1_35] Update Parsoid to 0.12.2

https://gerrit.wikimedia.org/r/677986

Change 677986 merged by C. Scott Ananian:

[mediawiki/vendor@REL1_35] Update Parsoid to 0.12.2

https://gerrit.wikimedia.org/r/677986

For prioritization: what is the specific use case?

I don't remember the specifics but I'm guessing it came up either during the Parsoid migration or (more likely) when discussing T255586: Replace HTMLFormatter by Remex. But in general I just feel it would improve the usability of Remex for small DOM transformation tasks (e.g. the kind of thing the first-sentence extraction logic does in the Page Content Services API - that was written as node.js because we did not have good HTML5 handling in PHP at the time, IMO we should incentivise using the MediaWiki REST API for similar things in the future).