Page MenuHomePhabricator

RemexHTML should be able to parse HTML into an existing DOM node
Open, Needs TriagePublic

Description

Currently the way to parse a HTML fragment with Remex is along the lines of

$domBuilder = new DOMBuilder();
$treeBuilder = new TreeBuilder( $domBuilder );
$dispatcher = new Dispatcher( $treeBuilder );
$tokenizer = new Tokenizer( $dispatcher, $html, [] );
$tokenizer->execute( [
    'fragmentNamespace' => HTMLData::NS_HTML,
    'fragmentName' => 'div',
] );
$wrapper = $domBuilder->getFragment();
foreach ( $wrapper->childNodes as $node ) {
    // do something with the resulting DOM forest
}

When used for innerHTML-style funcionality, that means Remex will create a document, build the DOM tree within it, then we have to import the nodes into the document where the inner HTML replacement is being done. ID indexes get lost during importing (although right now Remex doesn't support them anyway; that's T217696). It would be simpler and less error-prone if Remex could work within a given document (either with a detached fragment wrapper node, or using a specified node in the document for that).

Event Timeline

Tgr created this task.Mar 5 2019, 8:29 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2019, 8:29 PM
Tgr updated the task description. (Show Details)Mar 5 2019, 8:30 PM
Tgr added a comment.Mar 5 2019, 8:32 PM

Aside: AIUI fragmentName should determine the type of the element returned by $domBuilder->getFragment(), but in practice it always seems to be a HTML element. That probably leads to subtle bugs of its own when the HTML is invalid within the context of the parent element (e.g. someone wants to set the inner HTML of a <table> tag and the result needs reparenting).