RemexHTML should be able to parse HTML into an existing DOM node
Open, LowPublic
Actions

Assigned To

None

Authored By

	Tgr
	Mar 5 2019, 8:29 PM

Description

Currently the way to parse a HTML fragment with Remex is along the lines of

$domBuilder = new DOMBuilder();
$treeBuilder = new TreeBuilder( $domBuilder );
$dispatcher = new Dispatcher( $treeBuilder );
$tokenizer = new Tokenizer( $dispatcher, $html, [] );
$tokenizer->execute( [
    'fragmentNamespace' => HTMLData::NS_HTML,
    'fragmentName' => 'div',
] );
$wrapper = $domBuilder->getFragment();
foreach ( $wrapper->childNodes as $node ) {
    // do something with the resulting DOM forest
}

When used for innerHTML-style funcionality, that means Remex will create a document, build the DOM tree within it, then we have to import the nodes into the document where the inner HTML replacement is being done. ID indexes get lost during importing (although right now Remex doesn't support them anyway; that's T217696). It would be simpler and less error-prone if Remex could work within a given document (either with a detached fragment wrapper node, or using a specified node in the document for that).

Details

Subject	Repo	Branch	Lines +/-
Update Parsoid to 0.12.2	mediawiki/vendor	REL1_35	+66 -125
Remove special case for the html extension when unpacking	mediawiki/services/parsoid	REL1_35	+20 -45
Bump wikimedia/parsoid to 0.13.0-a12	mediawiki/vendor	master	+1 K -974
One document to rule them all	mediawiki/services/parsoid	master	+720 -523
Bump wikimedia/parsoid to 0.13.0-a8	mediawiki/vendor	master	+517 -331
Remove special case for the html extension when unpacking	mediawiki/services/parsoid	master	+20 -45

Customize query in gerrit

Related Objects

Mentioned In: T179082: Use one ownerDocument for the entire parse
T217850: Remex could use some helper/utility classes
T215000: Fill gaps in PHP DOM's functionality
Mentioned Here: T255586: Replace HTMLFormatter by Remex
T217696: Remex doesn't set ID attributes

Event Timeline

Tgr created this task.Mar 5 2019, 8:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 5 2019, 8:29 PM

Tgr updated the task description. (Show Details)Mar 5 2019, 8:30 PM

Aside: AIUI fragmentName should determine the type of the element returned by $domBuilder->getFragment(), but in practice it always seems to be a HTML element. That probably leads to subtle bugs of its own when the HTML is invalid within the context of the parent element (e.g. someone wants to set the inner HTML of a <table> tag and the result needs reparenting).

Tgr mentioned this in T215000: Fill gaps in PHP DOM's functionality.Mar 5 2019, 8:35 PM

Tgr mentioned this in T217850: Remex could use some helper/utility classes.Mar 14 2019, 12:41 AM

Tgr mentioned this in T179082: Use one ownerDocument for the entire parse.Jun 24 2019, 6:37 PM

Arlolra claimed this task.Jul 28 2020, 5:35 PM

Change 617282 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] [WIP] One document to rule them all

https://gerrit.wikimedia.org/r/617282

gerritbot added a project: Patch-For-Review.Jul 29 2020, 11:06 PM

In theory TreeBuilder should work fine w/ various parent contexts, that's how the parent WHATWG/W3C HTML parsing spec works. Haven't taken a good look at Remex yet to see how hard that is to do 'properly'...

https://github.com/fgnass/domino/issues/73

Change 622425 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Remove special case for the html extension

https://gerrit.wikimedia.org/r/622425

Change 622425 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/622425

Change 625641 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a8

https://gerrit.wikimedia.org/r/625641

Change 625641 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a8

https://gerrit.wikimedia.org/r/625641

Change 617282 merged by jenkins-bot:
[mediawiki/services/parsoid@master] One document to rule them all

https://gerrit.wikimedia.org/r/617282

Maintenance_bot removed a project: Patch-For-Review.Sep 29 2020, 11:10 PM

It would be simpler and less error-prone if Remex could work within a given document (either with a detached fragment wrapper node, or using a specified node in the document for that).

The spec seems to suggest creating a new document when parsing an HTML fragment,
https://html.spec.whatwg.org/#html-fragment-parsing-algorithm

In T217705#6506978, @Arlolra wrote:

The spec seems to suggest creating a new document when parsing an HTML fragment,

That's an internal detail. The DOM Parsing spec says the created fragment should be part of the context document, not a new document (Let fragment be a new DocumentFragment whose node document is context element's node document.) So the current behavior (or the one at the time of filing the bug, anyway; I have not checked recently) is clearly incorrect.

That's a bit different from what the task asks for (providing an option to parse into a node instead of a DOMDocumentFragment, given that will be the end goal for 99% of use cases) but not having to do an import (which in PHP's not-quite-compliant implementation is an extra source of fragility) would already be an improvement.

Whether or not strictly following the standard in the parsing steps and actually creating a new document is worth the presumable performance hit is also an internal detail, but worth considering, IMO.

Arlolra removed Arlolra as the assignee of this task.Sep 30 2020, 10:12 PM

Arlolra subscribed.

Change 635100 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a12

https://gerrit.wikimedia.org/r/635100

gerritbot added a project: Patch-For-Review.Oct 19 2020, 10:52 PM

Change 635100 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a12

https://gerrit.wikimedia.org/r/635100

Maintenance_bot removed a project: Patch-For-Review.Oct 20 2020, 2:10 AM

Change 662672 had a related patch set uploaded (by Paladox; owner: Arlolra):
[mediawiki/services/parsoid@REL1_35] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/662672

gerritbot added a project: Patch-For-Review.Feb 8 2021, 4:43 PM

Change 662672 merged by jenkins-bot:
[mediawiki/services/parsoid@REL1_35] Remove special case for the html extension when unpacking

https://gerrit.wikimedia.org/r/662672

Change 677986 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@REL1_35] Update Parsoid to 0.12.2

https://gerrit.wikimedia.org/r/677986

Change 677986 merged by C. Scott Ananian:

[mediawiki/vendor@REL1_35] Update Parsoid to 0.12.2

https://gerrit.wikimedia.org/r/677986

tstarling moved this task from Inbox to Actually In RemexHtml on the RemexHtml board.Dec 22 2022, 11:40 PM

For prioritization: what is the specific use case?

I don't remember the specifics but I'm guessing it came up either during the Parsoid migration or (more likely) when discussing T255586: Replace HTMLFormatter by Remex. But in general I just feel it would improve the usability of Remex for small DOM transformation tasks (e.g. the kind of thing the first-sentence extraction logic does in the Page Content Services API - that was written as node.js because we did not have good HTML5 handling in PHP at the time, IMO we should incentivise using the MediaWiki REST API for similar things in the future).

tstarling triaged this task as Low priority.Dec 23 2022, 6:11 AM

RemexHTML should be able to parse HTML into an existing DOM nodeOpen, LowPublicActions

Description

Details

Related Objects

Event Timeline

RemexHTML should be able to parse HTML into an existing DOM node
Open, LowPublic
Actions