Page MenuHomePhabricator

Fill gaps in PHP DOM's functionality
Closed, ResolvedPublic

Description

PHP DOM is based on DOM Level 1 Core and DOM Level 2 XML, and doesn't support some DOM HTML functionality that Parsoid relies on. Specifically, here is a list of gaps (non-exhaustive):

  • querySelector(..) and querySelectorAll(...)
  • body property on the DOMDocument
  • innerHTML, outerHTML setters / getters on DOMElement
  • classList on DOMElement

We need to provide home-grown utilities for these OR extend PHP's DOM implementation to support this.

Notes from our meeting on Jan 27, 2019

  • PHP DOM implementation: security fixes, spec compliance (2015), and other fixes ... so, being maintained
  • Five possible strategies to filling gaps:
    1. Update the DOM extension in PHP core to follow modern DOM standards
    2. Wrap a more modern C library (since libxml2 seems to be frozen in time)
      • In Sep 2018 PHP bumped libxml to 2.7.6, but it looks like libxml2 is in bugfix-only mode
    3. Write a pure-PHP implementation of the DOM (like domino and/or Remex) -- not happening but listed here for purposes of completeness
    4. Subclass the built-in DOM libraries's objects, overwrite with new functionality as needed -- most likely
      • This could be a composer library which others could help maintain
      • If/when functionality is added to core library, the subclass methods could be removed
      • Open Q: document.createElement() etc need to be overridden to create subclass
      • This is most pragmatic option, we'll probably do this (at least initially)
    5. Better yet: find *someone else* who is doing one of the above, and use their work!

Related Objects

Mentioned In
T204945: Deprecate one of the Preprocessor implementations for 1.34
T218183: Audit uses of PHP DOM in Wikimedia software
T217867: Port domino (or another spec-compliant DOM library) to PHP
rMLZE459622063d97: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE0bb6839c23e3: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE8d22c0ce9397: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE5552f635c0e8: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
T217766: Flow\Exception\WikitextException: ParseEntityRef: no name
rMLZE63a19879b1f8: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE829003598c0a: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZEd31425336579: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE238521ed469c: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE30566f9be355: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE63e3448acc0d: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZEd811b6b05f92: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZEe756a31b2582: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE416c9af2945f: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZEe7c949cb8781: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
T217708: Remex should offer an option to not set namespaceURI
T217700: Remex doesn't uppercase tag/node names
T217696: Remex doesn't set ID attributes
T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest
Mentioned Here
T218183: Audit uses of PHP DOM in Wikimedia software
T217766: Flow\Exception\WikitextException: ParseEntityRef: no name
T217708: Remex should offer an option to not set namespaceURI
T217696: Remex doesn't set ID attributes
T217705: RemexHTML should be able to parse HTML into an existing DOM node
T204608: Use a bag-on-the-side implementation, rather than an internal .dataobject for node data

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Doesn't T204608#4925526 affect this?

Doesn't T204608#4925526 affect this?

That comment implies that if you let the DOMDocument go out of scope, but retain a node reference, maybe $node->ownerDocument would be a new object, missing the custom property. But presumably we would keep a reference to the DOMDocument somewhere, so this would not happen. So I think it would be fine, as long as we are only extending the DOMDocument and not the nodes.

I am not sure we always keep around a reference to the document and this now requires that we keep it around always which is potentially a fragile scenario. A lof the time we are passing around body and other nodes knowing that we can always recover the owner document from any individual node. I think a safer solution would be to implement $doc->body() that uses a cached property body that you propose above. And, all code should use body() instead of body.

Anomie added a comment.Feb 7 2019, 8:59 PM
  1. Update the DOM extension in PHP core to follow modern DOM standards
  2. Wrap a more modern C library (since libxml2 seems to be frozen in time)
    • In Sep 2018 PHP bumped libxml to 2.7.6, but it looks like libxml2 is in bugfix-only mode

Long term this would be nice to do, and I wouldn't mind helping do it. But I don't think we want to block Parsoid-PHP on it.

  1. Subclass the built-in DOM libraries's objects, overwrite with new functionality as needed -- most likely
    • This could be a composer library which others could help maintain
    • If/when functionality is added to core library, the subclass methods could be removed
    • Open Q: document.createElement() etc need to be overridden to create subclass
    • This is most pragmatic option, we'll probably do this (at least initially)

Another option would be to do it as a "bag on the side", with static methods like PHPDomUtils::getBody( DOMDocument $doc ), PHPDomUtils::getOuterHtml( DOMNode $node ), PHPDomUtils::setOuterHtml( DOMNode $node, string $html ), and so on.

That'd be a bit less convenient to work with, but also less likely to give us trouble over inheritance and constructors. For example,

  • Is OurDOMElement a subclass of OurDOMNode or DOMElement? Or do we have a completely separate class hierarchy that wraps the PHP DOM classes? Or a separate hierarchy of interfaces, that are implemented by subclasses of the PHP DOM classes (which aren't supposed to be referred to directly, only as the interfaces)?
  • How do we manage to construct an OurDOMNode since we don't have access to the internal libxml state to properly initialize DOMNode? Or do we ignore everything from the base class and actually wrap another DOMNode, with every method and property redirected to the wrapped copy instead of the base class copy? Or, again, do we just have a completely separate class hierarchy?

We have shaved a great many yaks to get this far, and what is one more .. but, that said, I am very keen to avoid expanding the scope of the PHP port and elongate the critical path.

So, for now, I suggest we write helpers that provide the missing functionality. I am open to how that is done. But, once the port is done, someone can embark on the project to actually enhance PHP's abilities in this domain.

Tgr added a comment.Feb 8 2019, 9:03 PM

Subclassing probably would not work, since new DOM objects are instantiated by the original code, and there is no reliable and performant way of creating an OurDOMXXX instance from a DOMXXX instance. So it would have to be a wrapper/decorator approach (where the wrappers probably have the same hierarchy as the originals - making the wrappers subclasses of the originals would not gain us anything, the hierarchy would be subtly broken so they would not be interchangeable with the originals at all).

The proposed static methods are also a sort of wrapper, but only at the point of extraction, so less code. I guess it comes down to how often we use not natively supported methods - if not terribly often, the utils class seems like a nicer approach.

I am not sure we always keep around a reference to the document and this now requires that we keep it around always which is potentially a fragile scenario. A lof the time we are passing around body and other nodes knowing that we can always recover the owner document from any individual node. I think a safer solution would be to implement $doc->body() that uses a cached property body that you propose above. And, all code should use body() instead of body.

See this comment ... so, I think Tim's idea would work for solving that problem as well as for the .body property itself. Since $env is alive for the lifetime of the request, the PHP wrapper for the document root as well as the body can be kept alive by adding two properties in $env. I didn't fully grasp the import of Tim's suggestion till I realized I had essentially made the same suggestion on that gerrit patch. :-)

ssastry reassigned this task from cscott to Tgr.Feb 15 2019, 8:11 PM
ssastry added a subscriber: cscott.

Ok, I'm going to reassign this to @Tgr but @cscott is currently trying to generate an inventory of DOM methods that parsoid uses so we have a complete picture of all the gaps.

cscott added a comment.EditedFeb 15 2019, 10:28 PM

Here's an inventory of all domino top-level methods (that is, not those called internally w/in the domino implementation) invoked by Parsoid when running npm run parserTests. The first column is the number of dynamic invocations, which isn't so interesting. What would be more interesting is the number of different callsites within Parsoid; I'll try to generate that next.

I've added a third column which lists the equivalent PHP function, if any.

904578	Attr#name[get]	DOMAttr->name
1479526	Attr#value[get]	DOMAttr->value
680801	AttributesArray#item	-
517723	AttributesArray#length[get]	-
83	CharacterData#nextElementSibling[get]	-
172	CharacterData#previousElementSibling[get]	-
7	CharacterData#replaceWith	-
68618	Comment#data[get]	DOMComment->data
85524	Comment#nodeName[get]	DOMComment->nodeName
7990	Comment#nodeValue[get]	DOMComment->nodeValue
831	Comment#textContent[get]	DOMComment->textContent
1377098	ContainerNode#childNodes[get]	->childNodes
2654478	ContainerNode#firstChild[get]	->firstChild
334933	ContainerNode#hasChildNodes	DOMNode::hasChildNodes()
286212	ContainerNode#lastChild[get]	DOMNode->lastChild
25500	DOMTokenList#add	-
14375	DOMTokenList#contains	-
8028	DOMTokenList#length[get]	-
8	DOMTokenList#remove	-
378489	Document#body[get]	-
6761	Document#createComment	DOMDocument::createComment
627	Document#createDocumentFragment	DOMDocument::createDocumentFragment
132924	Document#createElement	DOMDocument::createElement
68069	Document#createTextNode	DOMDocument::createTextNode
5526	Document#getElementById	DOMDocument::getElementById (but probably broken)
39226	Document#head[get]	-
1625	Document#importNode	DOMDocument::importNode
23860	Document#nodeName[get]	DOMDocument->nodeName
9	Document#querySelector	-
4896	Document#querySelectorAll	-
3564	Document#title[set]	-
100	DocumentFragment#nodeName[get]	DOMDocumentFragment->nodeName
517783	Element#attributes[get]	-
43992	Element#classList[get]	-
1175	Element#clone	- (but DOMNode::cloneNode exists)
11374982	Element#getAttribute	DOMElement::getAttribute
27437	Element#hasAttribute	DOMElement::hasAttribute
1993	Element#lastElementChild[get]	-
125	Element#nextElementSibling[get]	-
12463819	Element#nodeName[get]	DOMElement->nodeName
1	Element#nodeValue[get]	DOMElement->nodeValue
218223	Element#outerHTML[get]	-
2931	Element#previousElementSibling[get]	-
10250	Element#querySelector	-
7241	Element#querySelectorAll	-
208	Element#remove	-
1013675	Element#removeAttribute	DOMElement::removeAttribute
271457	Element#setAttribute	DOMElement::setAttribute
253857	Element#tagName[get]	-
39251	Element#textContent[get]	DOMElement->textContent
11	Element#textContent[set]	DOMElement->textContent
48	FilteredElementList#length[get]	-
480	HTMLElement#innerHTML[get]	-
684	HTMLElement#innerHTML[set]	-
4896	HTMLParser.document	-
108989	HTMLParser.insertToken	-
90523	Leaf#childNodes[get]	DOMNode->childNodes
490041	Leaf#firstChild[get]	DOMNode->firstChild
16235	Leaf#hasChildNodes	DOMNode::hasChildNodes()
36427	Node#COMMENT_NODE[get]	unk
1062	Node#DOCUMENT_NODE[get]	unk
386862	Node#ELEMENT_NODE[get]	unk
191251	Node#TEXT_NODE[get]	unk
64121	Node#appendChild	DOMNode::appendChild
4652	Node#cloneNode	DOMNode::cloneNode
109101	Node#insertBefore	DOMNode::insertBefore
5657206	Node#nextSibling[get]	DOMNode->nextSibling
52988	Node#normalize	DOMNode::normalize (but check performance)
7717	Node#parentElement[get]	-
1846921	Node#previousSibling[get]	DOMNode->previousSibling
44631	Node#removeChild	DOMNode::removeChild
54041	Node#replaceChild	DOMNode::replaceChild
240206	Text#data[get]	DOMText->data
47662	Text#data[set]	DOMText->data
1909670	Text#nodeName[get]	DOMText->nodeName
1021897	Text#nodeValue[get]	DOMText->nodeValue
49	Text#nodeValue[set]	DOMText->nodeValue
40087	Text#textContent[get]	DOMText->textContent
26	Text#textContent[set]	DOMText->textContent
121906	domino.createDocument	-
1	domino.createWindow	-
13356	impl.HTMLParser	-
Tgr added a comment.EditedFeb 19 2019, 7:07 AM
  • Attr -> DOMAttr, seems equivalent
  • AttributesArray (NamedNodeMap?) -> DOMNamedNodeMap, seems equivalent
  • CharacterData -> DOMCharacterData - previousElementSibling, nextElementSibling missing
  • Comment -> DOMComment, seems equivalent
  • ContainerNode -> seems like this is just a part of Node that Domino has split into a separate file.
  • DOMTokenList -> missing
  • Document -> DOMDocument - body, head, title, querySelector(), querySelectorAll() missing
  • DocumentFragment -> DOMDocumentFragment
  • Element -> DOMElement - classList, clone(), lastElementChild, nextElementSibling, previousElementSibling, outerHTML, querySelector(), querySelectorAll(), remove() missing
  • FilteredElementList (NodeList?) -> DOMNodeList, seems equivalent (implementation details aside)
  • HTMLElement -> missing
  • HTMLParser -> not part of the DOM spec. I imagine the equivalent of this is Remex? Not sure if Remex has an equivalent of insertToken().
  • Leaf -> again not part of the DOM spec, just a code organization helper. The methods are all on Node.
  • Node -> DOMNode - parentElement missing
  • Text -> DOMText, seems equivalent
  • createWindow / Window -> not part of DOM spec. Seems like a hack that's not really meaningful in a non-JS language?

In short, what needs to be filled:

  • all the *Element* properties. Just iterate the non-element version of the property and return the first node that's an Element.
  • DOMTokenList, Element.classList - basically just a nice wrapper around attribute manipulation. Probably takes a bit of effort to make it match the spec exactly.
  • body / head / title on Document - trivial
  • querySelector() / querySelectorAll() on Document and Element. Hopefully we can convert these into xpath queries; need to decide whether we want something generic or a limited subset of CSS selectors based on how these are actually used in Parsoid.
  • Element.clone() - does not seem to be part of the spec, is this just an alias for cloneNode()?
  • Element.remove() - trivial, just call removeChild() on the parent
  • HTMLElement, innerHTML, outerHTML - basically just run XMLSerializer on the node. (Might need intelligent caching, maybe?)

Here's an inventory of all domino top-level methods (that is, not those called internally w/in the domino implementation) invoked by Parsoid when running npm run parserTests. The first column is the number of dynamic invocations, which isn't so interesting. What would be more interesting is the number of different callsites within Parsoid; I'll try to generate that next.
I've added a third column which lists the equivalent PHP function, if any.

25500	DOMTokenList#add	-
14375	DOMTokenList#contains	-
8028	DOMTokenList#length[get]	-
8	DOMTokenList#remove	-

I don't see this used in the Parsoid codebase. Is this internal domino use or is used to support other DOM functionality?

@subbu it's hard to grep for add! These are generally manipulations of Element#classList. See git grep classList which probably pulls up most of them.

@subbu it's hard to grep for add! These are generally manipulations of Element#classList. See git grep classList which probably pulls up most of them.

Ah, I was doing a git grep DOMTokenList ... which was silly .. in retrospect. :-)

ContainerNode -> seems like this is just a part of Node that Domino has split into a separate file.
Leaf -> again not part of the DOM spec, just a code organization helper. The methods are all on Node.

It's actually a refactoring of the DOM spec to capture commonality. It's an artifact of the fact that PHP's DOM implementation hasn't been updated in a long time.

createWindow / Window -> not part of DOM spec. Seems like a hack that's not really meaningful in a non-JS language?

This is actually from this snippet in Parsoid:

// Sanity check for dom behavior: we are
// relying on DOM level 4 getAttribute. In level 4, getAttribute on a
// non-existing key returns null instead of the empty string.
var testDom = domino.createWindow('<h1>Hello world</h1>').document;
if (testDom.body.getAttribute('somerandomstring') === '') {
	throw 'Your DOM version appears to be out of date! \n' +
			'Please run npm update in the js directory.';
}

This points to a larger issue -- I'm guessing PHP's DOM is pre-4.

  • Element.clone() - does not seem to be part of the spec, is this just an alias for cloneNode()?

Probably a Parsoid bug, I'll have to look up where this is being called. Domino probably should be using an underscore prefix, since you're right, this doesn't seem to be a spec method. (Maybe it was in the past and is now deprecated? More research needed.)

  • Element.remove() - trivial, just call removeChild() on the parent

Yes, although funnily enough it's usually implemented the other way (removeChild invokes remove).

  • HTMLElement, innerHTML, outerHTML - basically just run XMLSerializer on the node. (Might need intelligent caching, maybe?)

That handles the getter; the setter is the hard part.

This is actually from this snippet in Parsoid:

// Sanity check for dom behavior: we are
// relying on DOM level 4 getAttribute. In level 4, getAttribute on a
// non-existing key returns null instead of the empty string.
var testDom = domino.createWindow('<h1>Hello world</h1>').document;
if (testDom.body.getAttribute('somerandomstring') === '') {
	throw 'Your DOM version appears to be out of date! \n' +
			'Please run npm update in the js directory.';
}

This points to a larger issue -- I'm guessing PHP's DOM is pre-4.

Yes, it claims to be DOM Core Level 1 and DOM XML Level 2, with no DOM HTML at all.

$i = new DOMImplementation;
$i->hasFeature( 'Core', '1.0' ); // => bool(true)
$i->hasFeature( 'Core', '2.0' ); // => bool(false)
$i->hasFeature( 'XML', '1.0' );  // => bool(true)
$i->hasFeature( 'XML', '2.0' );  // => bool(true)
$i->hasFeature( 'XML', '3.0' );  // => bool(false)
$i->hasFeature( 'HTML', '1.0' ); // => bool(false)

Also,

$n = DOMDocument::loadXML( '<root foo="!!!" bar=""></root>' )->documentElement;

$n->hasAttribute( 'foo' ); // => bool(true)
$n->getAttribute( 'foo' ); // => string(3) "!!!"

$n->hasAttribute( 'bar' ); // => bool(true)
$n->getAttribute( 'bar' ); // => string(0) ""

$n->hasAttribute( 'baz' ); // => bool(false)
$n->getAttribute( 'baz' ); // => string(0) ""

So we probably need to thunk all calls to getAttribute() (where the value gets stored somewhere) and/or rewrite any tests against null to use hasAttribute() instead. (sigh)

Tgr added a comment.Feb 19 2019, 7:40 PM
  • HTMLElement, innerHTML, outerHTML - basically just run XMLSerializer on the node. (Might need intelligent caching, maybe?)

That handles the getter; the setter is the hard part.

The setter would be RemexHTML; still not problematic as long as we are not concerned about performance.

So we probably need to thunk all calls to getAttribute() (where the value gets stored somewhere) and/or rewrite any tests against null to use hasAttribute() instead. (sigh)

Or just handle getAttribute like the other missing methods and always call it via a wrapper.

Tgr added a comment.EditedFeb 19 2019, 10:13 PM

Within the domino and parsoid packages, these selectors are used with querySelector[All]:

  • simple CSS selector, including a combination of
    • element name (or star)
    • class name
    • id
    • attribute name with or without value (with quotes, possibly with case insensitive flag), = / ~=)
    • :empty
  • descendant selector
  • child selector
  • selector group (only with a list of simple element selectors)

PHP DOM uses XPath 1.0. The above selectors can be converted to it (roughly) as follows (lowercase letters are identifiers, uppercase ones are simple selector sequences):

  • * -> *
  • a -> a
  • .a -> like [class~="a"], see below
  • #a -> [@id="a"] or id("a")
  • [a] -> [@a]
  • [a="b"] -> [@a="b"]
  • [a="b"i] -> [lower-case(@a)="b"]
  • [a~="b"] -> [contains(concat(' ', normalize-space(@a), ' '), " b ")]
  • :empty -> [not(*) and not(normalize-space(text()))] maybe?
  • A B -> A//B
  • A > B -> A/B
  • A,B -> A|B

For parsing, we can use @Anomie's css-sanitizer library. If performance is more of a concern than robustness, the above subset is probably not super hard to parse with a regular expression or manual tokenization, either.

Tgr added a comment.Feb 20 2019, 2:58 AM

The syntax I'm playing with is along the lines of DomCompat::wrap( $node )->remove(). (Properties would have to be replaced by getters/setters since property getters/setters are a mess in PHP: DomCompat::wrap( $node )->getBody() etc.) That makes the syntax easier to remember/read than static methods.

Alternatively there could be a separate DomCompat::wrapElement, DomCompat::wrapDocument etc - that makes writing the code slightly more annoying (you need to remember what type of element you are dealing with) but more type safe.

The syntax I'm playing with is along the lines of DomCompat::wrap( $node )->remove(). (Properties would have to be replaced by getters/setters since property getters/setters are a mess in PHP: DomCompat::wrap( $node )->getBody() etc.) That makes the syntax easier to remember/read than static methods.

But, implementation wise ...would DomCompat::wrap create a new object each time?

Tgr added a comment.Feb 20 2019, 6:16 PM

It would. Object creation is pretty cheap in PHP though.

It would. Object creation is pretty cheap in PHP though.

The only potential concern there would be the use of getAttribute if it goes through DomCompat ... because right now, it is effectively called for every single node and after nodeName was the most commonly called DOM method. S

11374982 Element#getAttribute DOMElement::getAttribute

But, while porting, we could use hasAttribute & getAttribute pair everywhere.

Change 491892 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/services/parsoid@master] [WIP] Add DOMCompat classes to support the DOM methods missing from PHP DOM

https://gerrit.wikimedia.org/r/491892

  • attribute name with or without value (with quotes, possibly with case insensitive flag), = / ~=)

If Parsoid needs case-insensitive attribute selectors, and we use css-sanitizer for parsing as you did in your patch, then we'll need to update css-sanitizer to support Selectors Level 4 (it currently only supports Level 3).

We should probably do that anyway at some point, since https://caniuse.com/#feat=css-case-insensitive shows decent browser support. This would just raise the priority of doing that.

  • attribute name with or without value (with quotes, possibly with case insensitive flag), = / ~=)

If Parsoid needs case-insensitive attribute selectors, and we use css-sanitizer for parsing as you did in your patch, then we'll need to update css-sanitizer to support Selectors Level 4 (it currently only supports Level 3).
We should probably do that anyway at some point, since https://caniuse.com/#feat=css-case-insensitive shows decent browser support. This would just raise the priority of doing that.

My hunch is that we don't think we need case-insensitive attribute selectors. We should be able to use the right case ... we should audit our querySelector* uses to confirm.

Tgr added a comment.EditedMar 1 2019, 6:35 AM

New sadness: https://3v4l.org/UQYTG
Apparently the id => element map is not always updated when the DOM changes. Seems easy to force, at least.

Tgr added a comment.Mar 1 2019, 7:14 AM

...actually not, since it also applies to descendants of the removed node (and they'd have to be fixed individually). Filed as https://bugs.php.net/bug.php?id=77686

New sadness: https://3v4l.org/UQYTG
Apparently the id => element map is not always updated when the DOM changes.

Ugh!!

...actually not, since it also applies to descendants of the removed node (and they'd have to be fixed individually). Filed as https://bugs.php.net/bug.php?id=77686

So, unless the PHP DOM implementation fixes its bugs and updates its functionality, it looks like eventually we do want to identify other suitable DOM options.

But, for now, we need to figure out what our workaround here is, and if we are going to suffer from this.

Here are our uses of getElementById. DOMDiff runs on a fresh DOM. So, all good there. So, need to audit code in Cite and DOMDataUtils. And add a big warning note in the porting document on wiki, something all future code reviews (post port) will have to factor in.

[subbu@earth:~/work/wmf/parsoid/lib] git grep getElementById
ext/Cite/index.js:                      var bodyElt = node.ownerDocument.getElementById(dataMw.body.id);
ext/Cite/index.js:                              bodyElt = env.page.editedDoc.getElementById(dataMw.body.id);
ext/Cite/index.js:      var refNode = ref.ownerDocument.getElementById(linkBackId);
html2wt/DOMDiff.js:                             var htmlA = nodeA.ownerDocument.getElementById(vA)
html2wt/DOMDiff.js:                                     || this.domA.getElementById(vA);
html2wt/DOMDiff.js:                             var htmlB = nodeB.ownerDocument.getElementById(vB)
html2wt/DOMDiff.js:                                     || this.domB.getElementById(vB);
utils/DOMDataUtils.js:                  } while (document.getElementById(uid));
utils/DOMDataUtils.js:          var dpScriptElt = doc.getElementById('mw-pagebundle');
cscott added a comment.Mar 1 2019, 5:08 PM

@subbu -- we should probably change all the static uses to querySelector(#foo`), where that's possible (like the mw-pagebundle case). The querySelector library (whether that's xpath or zest) will take care of the id indexing, whatever form that happens to take.

But the places where we do getElementById(id) are trickier; it feels wrong to add a leading # and then strip it off again, especially since (in theory) you should also be passing the id through an escape function just in case there are weird alphanumerics there (there aren't, in HTML we generate, but *in theory* VE or some other client could give us back weird stuff so we shouldn't break DOMDiff in that case).

I guess use a wrapper for now, which falls back to querySelector() and it will probably be unindexed in the initial port (ugh) but we'll fix that ASAP (either by switching to a proper DOM or by getting a fix applied upstream to PHP).

Tgr added a comment.Mar 1 2019, 5:36 PM

I guess the options are:

  1. maybe it gets fixed upstream and backported to 7.2
  2. maybe there is a better workaround than setIdAttribute for forcing a refresh if the ID index of a subtree
  3. find all elements with an ID in the subtree (probably using XPath for performance) and reset them
  4. use XPath for ID lookups (which apparently does not suffer from this bug but might also be slower)
  5. use our own tree traversal and indexing
  6. wrap getElementById to walk up the parents and see if it is part of the document

6 seems the least painful, as long as getElementById is not broken in the other direction (where it doesn't return elements it should). Seems like ID handling is broken in more ways than one, though: https://3v4l.org/pi599

So yeah, what @cscott said.

But the places where we do getElementById(id) are trickier; it feels wrong to add a leading # and then strip it off again, especially since (in theory) you should also be passing the id through an escape function just in case there are weird alphanumerics there (there aren't, in HTML we generate, but *in theory* VE or some other client could give us back weird stuff so we shouldn't break DOMDiff in that case).

HTML5 allows pretty much everything in an ID. It must not be empty and must not contain spaces; in selectors, non-alphanumeric characters (more or less) must be turned into unicode escapes. XPath escaping is ugly but not complicated either.

I guess the options are:

wrap getElementById to walk up the parents and see if it is part of the document

6 seems the least painful, as long as getElementById is not broken in the other direction (where it doesn't return elements it should). Seems like ID handling is broken in more ways than one, though: https://3v4l.org/pi599

This seems the best option to me as well ... as we discussed on IRC, except for one potential place, all of our other uses are "safe" in that we aren't accessing it on a mutated DOM. But, we can just be safe than sorry and for the one-off access case, this wrapper and overhead is something we can handle.

For the storePageBundle scenario where we are repeatedly using getElementById, we can replace that code with a tree walk to process all nodes in one shot instead of fetching nodes by id one at a time.

Tgr added a subscriber: ori.Mar 4 2019, 6:38 AM

@ori dug out that there's deindexing when the object gets destroyed, just not when it gets detached, so this bug can only happen when something retains a reference to a detached element - most like in the same scope as the remove/removeChild call, where the variable used for passing in the removed element still exists. Still, probably worth using a wrapper.

I imagine we'll need a PHPCS rule or something similar for disallowing calls which we require a wrapper for.

@Tgr could you add this information to the PHP bug?

Tgr added a comment.Mar 4 2019, 7:24 AM

I've mentioned it there as well.

@Tgr, @Smalyshev, @ori but you cannot guarantee that all references to the deleted node will be removed before you use findElementById. So, this still is not a solution.

<?php

function del_it_conditionally($node, $condition) {
   if ($condition) {
      $node->parentNode->removeChild($node);
      unset($node);
   }
}

$doc = new DOMDocument;
$doc->loadHTML('<html><body id="x"><div id="y"></div></body></html>');
$body = $doc->getElementById('x');
$div = $doc->getElementById('y');
del_it_conditionally($div, false);
var_dump( $doc->getElementById('y') ); // <div id="y">
del_it_conditionally($div, true);
var_dump( $doc->getElementById('y') ); // <div id="y">

In that example above, you cannot unset ($div) in the caller. And in the general case, you won't know where the ref has escaped to.

@Tgr, @Smalyshev, @ori but you cannot guarantee that all references to the deleted node will be removed before you use findElementById. So, this still is not a solution.

So, this bug needs fixing on the PHP/libxml side, but yes, this explains why this hasn't really been encountered by projects this long ... since it is likely that in the common scenario as @Tgr notes, there won't be references held to the node being deleted very long. But, this could potentially lead to harder to detect bugs in some codebases.

Anyway for now, focusing on our usecase, we should use getElementById in a wrapper which could simply proxy it over without any workarounds but where we document this issue.

Tgr added a comment.Mar 5 2019, 3:30 AM

Using Remex for innerHTML tree building exposes some weird behavior:

$ psysh
Psy Shell v0.9.9 (PHP 7.2.13-1+0~20181207100540.13+stretch~1.gbpf57305 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
>>> $html = '<html><body><div>1</div><div>2</div>3</body></html>';
>>> $innerHtml = '<div>4</div><!-- 5 -->6';
>>> $doc = new DOMDocument();
>>> $doc->loadHTML( $html );
>>> $body = $doc->getElementsByTagName( 'body' )->item( 0 );
>>>
>>> $domBuilder = new RemexHtml\DOM\DOMBuilder();
>>> $treeBuilder = new RemexHtml\TreeBuilder\TreeBuilder( $domBuilder );
>>> $dispatcher = new RemexHtml\TreeBuilder\Dispatcher( $treeBuilder );
>>> $tokenizer = new RemexHtml\Tokenizer\Tokenizer( $dispatcher, $innerHtml, [] );
>>> $tokenizer->execute( [ 'fragmentNamespace' => RemexHtml\HTMLData::NS_HTML, 'fragmentName' => $body->tagName ] );
>>> $fragmentWrapper = $domBuilder->getFragment();
>>> $importedFragmentWrapper = $body->ownerDocument->importNode( $domBuilder->getFragment(), true );
>>>
>>> $fragmentWrapper->ownerDocument->saveHTML( $fragmentWrapper );
=> "<html xmlns="http://www.w3.org/1999/xhtml"><div>4</div><!-- 5 -->6</html>"
>>> $importedFragmentWrapper->ownerDocument->saveHTML( $importedFragmentWrapper );
=> "<html xmlns="http://www.w3.org/1999/xhtml"><div>4</div><!-- 5 -->6</html>"
>>> $importedFirstChild = $doc->importNode( $fragmentWrapper->firstChild, true );
>>> $doc->saveHTML( $importedFirstChild );
=> "<div xmlns="http://www.w3.org/1999/xhtml">4</div>"
>>> Parsoid\Wt2Html\XMLSerializer::serialize( $importedFragmentWrapper, [ 'addDoctype' => false ] )['html']
=> "<html><div>4</div><!-- 5 -->6</html>"

First, why is the wrapper a <html> when I explicitly passed in body as a tagname? Maybe I'm using Remex the wrong way (the complete lack of documentation doesn't help) but as far as I can tell from the code it should build a fragment where the top-level element is of the type given in fragmentName, but that doesn't seem to work.

Second, Remex returns an element with an extra xmlns attribute, which behaves non-intuitively - if I import the wrapper and get its first child, the namespace is not inherited, if I get the child first and import that, it is. This is probably not too much of an issue, since XMLSerializer ignores these namespaces anyway.

Tgr added a comment.Mar 5 2019, 8:22 AM

The performance difference can be pretty large:

$ psysh
Psy Shell v0.9.9 (PHP 7.2.13-1+0~20181207100540.13+stretch~1.gbpf57305 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
>>> ($html = file_get_contents( 'obama.html' )) || true 
>>> $doc = new DOMDocument;
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag section invalid in Entity, line: 2 in Psy Shell code on line 1

>>> timeit -n1000 $doc->getElementById( 'cite_ref-290' );
Command took 0.000005 seconds on average (0.000003 median; 0.005050 total) to complete.
>>> timeit -n1000 ( new DOMXPath( $doc ) )->query( './/*[@id="cite_ref-290"]', $doc->documentElement )->item( 0 )
Command took 0.015575 seconds on average (0.014871 median; 15.574900 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( '*#cite_ref-290', $doc )[0]
Command took 13.418583 seconds on average (13.511294 median; 134.185828 total) to complete.

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( Parsoid\Utils\DOMCompat\SelectorToXPath::convert( 'sup' ), $doc->documentElement )
Command took 0.003930 seconds on average (0.003789 median; 0.393032 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( 'sup', $doc )
Command took 0.038177 seconds on average (0.035128 median; 0.381772 total) to complete.

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( Parsoid\Utils\DOMCompat\SelectorToXPath::convert( 'section > figure[typeof="mw:Image/Thumb"] a[rel="mw:WikiLink"]' ), $doc->documentElement )
Command took 0.009089 seconds on average (0.007681 median; 0.908874 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( 'section > figure[typeof="mw:Image/Thumb"] a[rel="mw:WikiLink"]', $doc )
Command took 4.690819 seconds on average (4.526054 median; 46.908187 total) to complete.
Tgr added a comment.Mar 5 2019, 9:09 AM
>>> timeit -n1000 $doc->getElementById( 'cite_ref-290' );
Command took 0.000004 seconds on average (0.000003 median; 0.004457 total) to complete.
>>> timeit -n1000 Parsoid\Utils\DOMCompat::getElementById( $doc, 'cite_ref-290' )
Command took 0.000010 seconds on average (0.000008 median; 0.009842 total) to complete.

So that's not too bad (granted the DOM tree is shallow, but Wikipedia articles usually are). But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

Anomie added a comment.Mar 5 2019, 4:34 PM

But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

Where is that mentioned above? The only thing I see mentioned is that ->setIdAttribute() seems to screw up the index, but do we even intend to use that method?

cscott added a comment.Mar 5 2019, 4:47 PM

Oh, Zest doesn't do any id indexing at all yet, because I couldn't figure out what the right way to feature-check that it was appropriate was. (I don't want to do the getElementsById optimization if the document doesn't have an id attribute registered.) I can fix that, but basically id-based matches are the worst case for zest right now. Changing the selectors to do class-based matching would be a fairer comparison, since (AFAIK) both xpath and zest would have to do a full-tree traversal for that.

(Zest has an optimization based on the element-by-tag-name index, as well. So that's something to watch when benchmarking.)

cscott added a comment.Mar 5 2019, 5:26 PM

Here's another tiny benchmark, which points out the importance of indexes:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2311}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/.composereval()'d code on line 1
>>> timeit -n100 count(\Wikimedia\Zest\Zest::find('sup.mw-ref', $doc));
=> 540
Command took 0.030899 seconds on average (0.030505 median; 3.089879 total) to complete.
>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/sup[@class="mw-ref"]', $doc->documentElement);
=> DOMNodeList {#3038
     +length: 540,
   }
Command took 0.003295 seconds on average (0.003119 median; 0.329492 total) to complete.

But if you don't have an index (tag name or id):

>>> timeit -n10 count(\Wikimedia\Zest\Zest::find('.mw-ref', $doc));
=> 540
Command took 9.394339 seconds on average (9.081321 median; 93.943387 total) to complete.
>>> timeit -n10 ( new DOMXPath( $doc ) )->query( './/*[@class="mw-ref"]', $doc->documentElement);
=> DOMNodeList {#3039
     +length: 540,
   }
Command took 0.014997 seconds on average (0.013384 median; 0.149972 total) to complete.

Checking a different attribute to make sure this isn't another sneaky index (browsers do index by class name):

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/*[@data-parsoid]', $doc->documentElement);
=> DOMNodeList {#3023
     +length: 7237,
   }
Command took 0.030151 seconds on average (0.030110 median; 3.015094 total) to complete.
>>> timeit -n3 count(\Wikimedia\Zest\Zest::find('*[data-parsoid]', $doc));=> 7237
Command took 7.589179 seconds on average (7.412123 median; 22.767538 total) to complete.

But it seems like it would be hard to go faster in PHP:

>>> timeit -n4 $cnt=0; foreach ($doc->getElementsByTagName('*') as $e) { if ($e->hasAttribute('data-parsoid')) { $cnt++; } }
Command took 9.847027 seconds on average (9.253243 median; 39.388110 total) to complete.
>>> $cnt;
=> 7237

Something is really slow here; it shouldn't take *that* long to look at 15,572 objects and do a query:

>>> ($el = $doc->getElementsByTagName('*')->item(42)) || true;
=> true
>>> timeit -n 15572 if ($el->hasAttribute('data-parsoid')) { $cnt++; }
Command took 0.000001 seconds on average (0.000001 median; 0.021440 total) to complete.
>>> timeit -n2 foreach ($doc->getElementsByTagName('*') as $e) { true; }
Command took 4.777420 seconds on average (4.730395 median; 9.554840 total) to complete.

So maybe we shouldn't use getElementsByTagName!

>>> timeit -n2 foreach (( new DOMXPath( $doc ) )->query( './/*', $doc->documentElement) as $e) { true; }
Command took 0.004565 seconds on average (0.004408 median; 0.009131 total) to complete.
>>> timeit -n2 $cnt=0; foreach (( new DOMXPath( $doc ) )->query( './/*', $doc->documentElement) as $e) { if ($e->hasAttribute('data-parsoid')) { $cnt++; } }
Command took 0.015675 seconds on average (0.011512 median; 0.031349 total) to complete.
>>> $cnt;
=> 7237
>>>

Sigh. Another method in PHP's DOM implementation which is, if not broken, at least unusably slow. At least we don't use this one directly in Parsoid...

cscott added a comment.Mar 5 2019, 6:18 PM

Ok, latest version is now faster than XPath:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/Projects/Wikimedia/zest.phpeval()'d code on line 1
>>> timeit -n100 count(\Wikimedia\Zest\Zest::find('*[data-parsoid]', $doc));
=> 7237
Command took 0.024003 seconds on average (0.023284 median; 2.400329 total) to complete.
>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/*[@data-parsoid]', $doc->documentElement);
=> DOMNodeList {#8402
     +length: 7237,
   }
Command took 0.028514 seconds on average (0.027853 median; 2.851372 total) to complete.

All credit to @Tgr for doing the benchmark and finding this performance issue, and to peter at softcoded dot com whose comment on http://php.net/manual/en/class.domxpath.php clued me in to a namespace issue that was flummoxing me.

I'm going to go ahead and implement getElementsById on top of XPath as well, which should give us fast performance on ID comparisons as well (assuming XPath is using the index... I should double-check that).

Tgr added a comment.Mar 5 2019, 8:10 PM

So, apart from code review, what's left here is one of:

Tgr added a comment.Mar 5 2019, 8:35 PM

But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

Where is that mentioned above? The only thing I see mentioned is that ->setIdAttribute() seems to screw up the index, but do we even intend to use that method?

In T215000#4994273, but in confusing way. There are two issues here:

Also the fact that calling setIdAttribute can unindex something makes me uncomfortable, given that we'll have to use it for T217696: Remex doesn't set ID attributes.

cscott added a comment.Mar 5 2019, 8:59 PM

There's also an issue with namespaces, given that Remex creates nodes with a namespace set:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/Projects/Wikimedia/zest.phpeval()'d code on line 1
>>> require('./tests/ZestTest.php')                                             => 1
>>> $doc2 = \Wikimedia\Zest\Tests\ZestTest::loadHtml("./obama.html"); /* uses remex */
>>> count( ( new DOMXPath( $doc ) )->query( './/p', $doc) );
=> 148
>>> count( ( new DOMXPath( $doc2 ) )->query( './/p', $doc2) );
=> 0

What's going on here is that remex set the namespaceURI, and XPath is sensitive to this:

>>> $doc->documentElement->namespaceURI;
=> null
>>> $doc2->documentElement->namespaceURI;
=> "http://www.w3.org/1999/xhtml"
>>> $xpath = new DOMXPath($doc2);
>>> $xpath->registerNamespace( 'ns', $doc2->documentElement->namespaceURI);
=> true
>>> count( $xpath->query( './/p', $doc2) );
=> 0
>>> count( $xpath->query( './/ns:p', $doc2) );
=> 148

Unfortunately, this has later ramifications if you use the Document#createElement() method (as Parsoid does):

>>> $el = $doc->createElement('p');
>>> $doc->documentElement->lastChild->appendChild($el);
>>> count( ( new DOMXPath( $doc ) )->query( './/p', $doc) );
=> 149
>>> $el = $doc2->createElement('p'); true;
=> true
>>> $doc2->documentElement->lastChild->appendChild($el); true;
=> true
>>> count( $xpath->query( './/ns:p', $doc2) );
=> 148

The new element was created without a namespace, so it doesn't show up in the queries that enumerate all the other elements.

FWIW, the "correct" (spec-compliant) behavior is for Document#createElement to inherit the namespaceURI of the parent:

> document.documentElement.namespaceURI
"http://www.w3.org/1999/xhtml"
> document.createElement('p').namespaceURI
"http://www.w3.org/1999/xhtml"

So Remex is correct here; it's the PHP DOM implementation of createElement which is broken.

Tgr added a comment.Mar 5 2019, 9:03 PM

So in practice we'll need a createElement wrapper? (And maybe a DOMDocument constructor wrapper to make sure they always have the right namespace, although I imagine manual document creation will only happen in tests)?

I'm suggesting T217708: Remex should offer an option to not set namespaceURI, which is a hacky workaround but means that we wouldn't have to further wrap createElement & friends.

cscott added a comment.Mar 7 2019, 3:58 PM

Another fun bug -- if you use DOMDocument::loadHTML(), the top-level node (DOMDocument) has nodeType of 13 (which is not defined in any spec ever, not even DOM level 1) instead of 9 (which is what the spec says it should be).

It's still an instance of DOMDocument, though! I don't know why they managed to make DOMDocument return the wrong nodeType in this one particular case. <bangs head on desk>

PHP defines node type 13 as XML_HTML_DOCUMENT_NODE which, again, is not a real thing.

Tgr added a comment.Mar 7 2019, 7:58 PM

Maybe it is some misguided attempt to account for the difference between Document in the DOM spec and Document in the HTML spec? (Not that DOMDocument implements anything out of the latter interface...)
Should be worth an upstream report.

cscott added a comment.Mar 7 2019, 8:26 PM

Seems to come from libxml: https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c#L2323

That said, PHP could override nodeType in order to provide a spec-compliant view, or hack a proper node type in. I guess it's more likely to be fixed in PHP than in libxml?

Catrope added a subscriber: Catrope.EditedMar 7 2019, 9:25 PM

Based on T217766#5006890:

> $doc = new DOMDocument();
> $doc->loadHTML( '<p>Hello</p>' );
> $doc->getElementsByTagName('p')->item(0)->setAttribute('data-foo', '"<!-- foo&bar -->');
> echo $doc->saveHTML($doc->getElementsByTagName('body')->item(0));
<body><p data-foo='"<!-- foo&bar -->'>Hello</p></body>

which is invalid (because of the unencoded amperssand) and throws an error when reparsed. This only happens if the attribute contains something that looks like a comment, if you change <!-- to <!- it doesn't happen.

It appears that using $doc->saveXML( $node, LIBXML_NOEMPTYTAG ) instead of $doc->saveHTML( $node ) works around this without other negative effects, other than that smart quoting doesn't work (i.e. double quotes will always be used, even if using single quotes would be more efficient).

cscott added a comment.EditedMar 7 2019, 9:35 PM

Weird. Luckily (?) for Parsoid, this seems to be a bug in saveHTML(), which I don't think we're actually using? setAttribute/getAttribute handle the "true" (unencoded) value of the attribute fine. PHP:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> $doc = new DOMDocument();
=> DOMDocument {#2308}
>>> $doc->loadHTML('<p>Hello</p>');
=> true
>>> $el = $doc->getElementsByTagName('p')->item(0);
=> DOMElement {#2325}
>>> $el->setAttribute('data-foo', '"<!-- foo&bar -->');
=> DOMAttr {#2311}
>>> $el->getAttribute('data-foo');
=> ""<!-- foo&bar -->"
>>> $doc->saveHTML($el);
=> "<p data-foo='"<!-- foo&bar -->'>Hello</p>"

vs domino (which should match browsers and the spec):

$ node
> var domino = require('./');
undefined
> var doc = domino.createDocument('<p>Hello</p>');
undefined
> el = doc.querySelector('p');
HTMLParagraphElement {}
> el.setAttribute('data-foo', '"<!-- foo&bar -->');
undefined
> el.getAttribute('data-foo');
'"<!-- foo&bar -->'
> el.outerHTML;
'<p data-foo="&quot;<!-- foo&amp;bar -->">Hello</p>'

I think our DOMCompatibility layer already includes our own XMLSerializer so we won't have to rely on PHP's saveHTML craziness?

What does Parsoid use to serialize DOMDocuments that doesn't use saveHTML() and/or doesn't suffer from this bug?

cscott added a comment.Mar 8 2019, 8:49 PM

@Catrope: https://github.com/wikimedia/parsoid/blob/34964e38e431238ecada0f052b4b81a5a19db84d/src/Wt2Html/XMLSerializer.php but @ssastry says it's not terribly fast right now. We should look into that, maybe there are performance tweaks that are possible.

Tgr added a comment.Mar 13 2019, 6:30 AM

Filed T218183: Audit uses of PHP DOM in Wikimedia software about seeing where else these fixes could be useful.

Tgr added a comment.Mar 13 2019, 6:35 AM

Symfony has a pretty mature looking CSS selector to XPath library. (I probably should have checked that before writing the same logic in 491892...) Should be benchmarked against Zest at some point.

I like this section from Symfony's page, which describes why it's nicer for long-term maintenance to use CSS selectors in the code:

When you're parsing an HTML or an XML document, by far the most powerful method is XPath.
XPath expressions are incredibly flexible, so there is almost always an XPath expression that will find the element you need. Unfortunately, they can also become very complicated, and the learning curve is steep. Even common operations (such as finding an element with a particular class) can require long and unwieldy expressions.
Many developers -- particularly web developers -- are more comfortable using CSS selectors to find elements. As well as working in stylesheets, CSS selectors are used in JavaScript with the querySelectorAll() function and in popular JavaScript libraries such as jQuery, Prototype and MooTools.
CSS selectors are less powerful than XPath, but far easier to write, read and understand. Since they are less powerful, almost all CSS selectors can be converted to an XPath equivalent. This XPath expression can then be used with other functions and classes that use XPath to find elements in a document.

They seem to claim pretty complete implementation of CSS; the parts they omit are also omitted in domino/zest for the most part (because there is no layout engine or UX associated with the dom). On the other hand, it's nice to have a ported Zest in my back pocket to hack into pieces if it turns out that XPath is hard to get working on an alternative DOM implementation.

We don't use this (I don't think) but be aware that if you create an attribute named 'xmlns' in PHP's DOM everything breaks: https://marc.info/?l=php-internals&m=155249142123136&w=2

Change 491892 had a related patch set uploaded (by C. Scott Ananian; owner: Gergő Tisza):
[mediawiki/services/parsoid@master] Add DOMCompat classes to support the DOM methods missing from PHP DOM

https://gerrit.wikimedia.org/r/491892

Change 491892 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add DOMCompat classes to support the DOM methods missing from PHP DOM

https://gerrit.wikimedia.org/r/491892

ssastry closed this task as Resolved.Mar 23 2019, 2:20 PM

This functionality has now been merged into the Parsoid codebase and we are starting to use it. Thanks @Tgr and @cscott. Exposing this to user outside Parsoid would be a separate task.

cscott added a comment.EditedJun 27 2019, 8:54 PM

Another difference we found: PHP's Node::normalize() doesn't actually remove zero-length text nodes, like the spec says it should (DOM level 2: https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-normalize ; latest DOM spec: https://dom.spec.whatwg.org/#dom-node-normalize ).

Reported upstream: https://bugs.php.net/bug.php?id=78221

Change 524385 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Work around PHP DOM's failure to normalize DOM during DOM build

https://gerrit.wikimedia.org/r/524385

Change 524385 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Work around PHP DOM's failure to normalize DOM during DOM build
https://gerrit.wikimedia.org/r/524385

Looks like this is a Remex bug, not a PHP DOM bug.