Page MenuHomePhabricator

Fill gaps in PHP DOM's functionality
Closed, ResolvedPublic

Description

PHP DOM is based on DOM Level 1 Core and DOM Level 2 XML, and doesn't support some DOM HTML functionality that Parsoid relies on. Specifically, here is a list of gaps (non-exhaustive):

  • querySelector(..) and querySelectorAll(...)
  • body property on the DOMDocument
  • innerHTML, outerHTML setters / getters on DOMElement
  • classList on DOMElement

We need to provide home-grown utilities for these OR extend PHP's DOM implementation to support this.

Notes from our meeting on Jan 27, 2019

  • PHP DOM implementation: security fixes, spec compliance (2015), and other fixes ... so, being maintained
  • Five possible strategies to filling gaps:
    1. Update the DOM extension in PHP core to follow modern DOM standards
    2. Wrap a more modern C library (since libxml2 seems to be frozen in time)
      • In Sep 2018 PHP bumped libxml to 2.7.6, but it looks like libxml2 is in bugfix-only mode
    3. Write a pure-PHP implementation of the DOM (like domino and/or Remex) -- not happening but listed here for purposes of completeness
    4. Subclass the built-in DOM libraries's objects, overwrite with new functionality as needed -- most likely
      • This could be a composer library which others could help maintain
      • If/when functionality is added to core library, the subclass methods could be removed
      • Open Q: document.createElement() etc need to be overridden to create subclass
      • This is most pragmatic option, we'll probably do this (at least initially)
    5. Better yet: find *someone else* who is doing one of the above, and use their work!

Related Objects

Mentioned In
T317255: PHP's getElementsByTagName is slow prior to PHP 8.3
T317070: MobileFormatter has quadratic performance
T258964: HtmlFormatter library: Code stewardship review
T235231: Parsoid/JS video tag has a "seek" parameter in the URL that Parsoid/PHP video tag output doesn't
T237556: Detect html2wt reqs issued to Parsoid/PHP with data-parsoid blobs generated by Parsoid/JS and issue a HTTP 421
T236846: TemplateHandler.php: trim() expects parameter 1 to be string, boolean given
T235217: Parsoid should use protocol-relative URLs for media
T235656: Ref fragments remain unexpanded in Image:Frameless, mw:ExpandedAttrs, mw:LanguageVariant nodes
T235295: MathML tags are missing xmlns attribute
T232390: Remex does not use \DOMElement::setIdAttribute('id') by default
T232180: Zest.php: Pagebundle routes timeout for some pages
T230861: PHP 7.2 is very slow on an allocation-intensive benchmark
T204945: Deprecate one of the Preprocessor implementations for 1.34
T218183: Audit uses of PHP DOM in Wikimedia software
T217867: Port domino (or another spec-compliant DOM library) to PHP
rMLZE0bb6839c23e3: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE459622063d97: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE8d22c0ce9397: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE5552f635c0e8: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
T217766: Flow\Exception\WikitextException: ParseEntityRef: no name
rMLZE63a19879b1f8: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE829003598c0a: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZEd31425336579: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE238521ed469c: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE30566f9be355: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZE63e3448acc0d: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZEd811b6b05f92: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZEe756a31b2582: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
rMLZE416c9af2945f: Optimize ID attribute search, if the DOM implementation has indexed it
rMLZEe7c949cb8781: Speed up Zest by 100x by not using DOMDocument#getElementsByTagName()
T217708: Remex should offer an option to not set namespaceURI
T217700: Remex doesn't uppercase tag/node names
T217696: Remex doesn't set ID attributes
T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest
Mentioned Here
T235217: Parsoid should use protocol-relative URLs for media
T235231: Parsoid/JS video tag has a "seek" parameter in the URL that Parsoid/PHP video tag output doesn't
T235656: Ref fragments remain unexpanded in Image:Frameless, mw:ExpandedAttrs, mw:LanguageVariant nodes
T236846: TemplateHandler.php: trim() expects parameter 1 to be string, boolean given
T237556: Detect html2wt reqs issued to Parsoid/PHP with data-parsoid blobs generated by Parsoid/JS and issue a HTTP 421
rGPAR6a0a7085a258: Remove seek stripping from html diffing script
T235295: MathML tags are missing xmlns attribute
T218183: Audit uses of PHP DOM in Wikimedia software
T217766: Flow\Exception\WikitextException: ParseEntityRef: no name
T217708: Remex should offer an option to not set namespaceURI
T217696: Remex doesn't set ID attributes
T217705: RemexHTML should be able to parse HTML into an existing DOM node
T204608: Use a bag-on-the-side implementation, rather than an internal .dataobject for node data

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • attribute name with or without value (with quotes, possibly with case insensitive flag), = / ~=)

If Parsoid needs case-insensitive attribute selectors, and we use css-sanitizer for parsing as you did in your patch, then we'll need to update css-sanitizer to support Selectors Level 4 (it currently only supports Level 3).

We should probably do that anyway at some point, since https://caniuse.com/#feat=css-case-insensitive shows decent browser support. This would just raise the priority of doing that.

  • attribute name with or without value (with quotes, possibly with case insensitive flag), = / ~=)

If Parsoid needs case-insensitive attribute selectors, and we use css-sanitizer for parsing as you did in your patch, then we'll need to update css-sanitizer to support Selectors Level 4 (it currently only supports Level 3).

We should probably do that anyway at some point, since https://caniuse.com/#feat=css-case-insensitive shows decent browser support. This would just raise the priority of doing that.

My hunch is that we don't think we need case-insensitive attribute selectors. We should be able to use the right case ... we should audit our querySelector* uses to confirm.

New sadness: https://3v4l.org/UQYTG
Apparently the id => element map is not always updated when the DOM changes. Seems easy to force, at least.

...actually not, since it also applies to descendants of the removed node (and they'd have to be fixed individually). Filed as https://bugs.php.net/bug.php?id=77686

New sadness: https://3v4l.org/UQYTG
Apparently the id => element map is not always updated when the DOM changes.

Ugh!!

...actually not, since it also applies to descendants of the removed node (and they'd have to be fixed individually). Filed as https://bugs.php.net/bug.php?id=77686

So, unless the PHP DOM implementation fixes its bugs and updates its functionality, it looks like eventually we do want to identify other suitable DOM options.

But, for now, we need to figure out what our workaround here is, and if we are going to suffer from this.

Here are our uses of getElementById. DOMDiff runs on a fresh DOM. So, all good there. So, need to audit code in Cite and DOMDataUtils. And add a big warning note in the porting document on wiki, something all future code reviews (post port) will have to factor in.

[subbu@earth:~/work/wmf/parsoid/lib] git grep getElementById
ext/Cite/index.js:                      var bodyElt = node.ownerDocument.getElementById(dataMw.body.id);
ext/Cite/index.js:                              bodyElt = env.page.editedDoc.getElementById(dataMw.body.id);
ext/Cite/index.js:      var refNode = ref.ownerDocument.getElementById(linkBackId);
html2wt/DOMDiff.js:                             var htmlA = nodeA.ownerDocument.getElementById(vA)
html2wt/DOMDiff.js:                                     || this.domA.getElementById(vA);
html2wt/DOMDiff.js:                             var htmlB = nodeB.ownerDocument.getElementById(vB)
html2wt/DOMDiff.js:                                     || this.domB.getElementById(vB);
utils/DOMDataUtils.js:                  } while (document.getElementById(uid));
utils/DOMDataUtils.js:          var dpScriptElt = doc.getElementById('mw-pagebundle');

@subbu -- we should probably change all the static uses to querySelector(#foo`), where that's possible (like the mw-pagebundle case). The querySelector library (whether that's xpath or zest) will take care of the id indexing, whatever form that happens to take.

But the places where we do getElementById(id) are trickier; it feels wrong to add a leading # and then strip it off again, especially since (in theory) you should also be passing the id through an escape function just in case there are weird alphanumerics there (there aren't, in HTML we generate, but *in theory* VE or some other client could give us back weird stuff so we shouldn't break DOMDiff in that case).

I guess use a wrapper for now, which falls back to querySelector() and it will probably be unindexed in the initial port (ugh) but we'll fix that ASAP (either by switching to a proper DOM or by getting a fix applied upstream to PHP).

I guess the options are:

  1. maybe it gets fixed upstream and backported to 7.2
  2. maybe there is a better workaround than setIdAttribute for forcing a refresh if the ID index of a subtree
  3. find all elements with an ID in the subtree (probably using XPath for performance) and reset them
  4. use XPath for ID lookups (which apparently does not suffer from this bug but might also be slower)
  5. use our own tree traversal and indexing
  6. wrap getElementById to walk up the parents and see if it is part of the document

6 seems the least painful, as long as getElementById is not broken in the other direction (where it doesn't return elements it should). Seems like ID handling is broken in more ways than one, though: https://3v4l.org/pi599

So yeah, what @cscott said.

But the places where we do getElementById(id) are trickier; it feels wrong to add a leading # and then strip it off again, especially since (in theory) you should also be passing the id through an escape function just in case there are weird alphanumerics there (there aren't, in HTML we generate, but *in theory* VE or some other client could give us back weird stuff so we shouldn't break DOMDiff in that case).

HTML5 allows pretty much everything in an ID. It must not be empty and must not contain spaces; in selectors, non-alphanumeric characters (more or less) must be turned into unicode escapes. XPath escaping is ugly but not complicated either.

I guess the options are:

wrap getElementById to walk up the parents and see if it is part of the document

6 seems the least painful, as long as getElementById is not broken in the other direction (where it doesn't return elements it should). Seems like ID handling is broken in more ways than one, though: https://3v4l.org/pi599

This seems the best option to me as well ... as we discussed on IRC, except for one potential place, all of our other uses are "safe" in that we aren't accessing it on a mutated DOM. But, we can just be safe than sorry and for the one-off access case, this wrapper and overhead is something we can handle.

For the storePageBundle scenario where we are repeatedly using getElementById, we can replace that code with a tree walk to process all nodes in one shot instead of fetching nodes by id one at a time.

@ori dug out that there's deindexing when the object gets destroyed, just not when it gets detached, so this bug can only happen when something retains a reference to a detached element - most like in the same scope as the remove/removeChild call, where the variable used for passing in the removed element still exists. Still, probably worth using a wrapper.

I imagine we'll need a PHPCS rule or something similar for disallowing calls which we require a wrapper for.

@Tgr could you add this information to the PHP bug?

I've mentioned it there as well.

@Tgr, @Smalyshev, @ori but you cannot guarantee that all references to the deleted node will be removed before you use findElementById. So, this still is not a solution.

<?php

function del_it_conditionally($node, $condition) {
   if ($condition) {
      $node->parentNode->removeChild($node);
      unset($node);
   }
}

$doc = new DOMDocument;
$doc->loadHTML('<html><body id="x"><div id="y"></div></body></html>');
$body = $doc->getElementById('x');
$div = $doc->getElementById('y');
del_it_conditionally($div, false);
var_dump( $doc->getElementById('y') ); // <div id="y">
del_it_conditionally($div, true);
var_dump( $doc->getElementById('y') ); // <div id="y">

In that example above, you cannot unset ($div) in the caller. And in the general case, you won't know where the ref has escaped to.

@Tgr, @Smalyshev, @ori but you cannot guarantee that all references to the deleted node will be removed before you use findElementById. So, this still is not a solution.

So, this bug needs fixing on the PHP/libxml side, but yes, this explains why this hasn't really been encountered by projects this long ... since it is likely that in the common scenario as @Tgr notes, there won't be references held to the node being deleted very long. But, this could potentially lead to harder to detect bugs in some codebases.

Anyway for now, focusing on our usecase, we should use getElementById in a wrapper which could simply proxy it over without any workarounds but where we document this issue.

Using Remex for innerHTML tree building exposes some weird behavior:

$ psysh
Psy Shell v0.9.9 (PHP 7.2.13-1+0~20181207100540.13+stretch~1.gbpf57305 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
>>> $html = '<html><body><div>1</div><div>2</div>3</body></html>';
>>> $innerHtml = '<div>4</div><!-- 5 -->6';
>>> $doc = new DOMDocument();
>>> $doc->loadHTML( $html );
>>> $body = $doc->getElementsByTagName( 'body' )->item( 0 );
>>>
>>> $domBuilder = new RemexHtml\DOM\DOMBuilder();
>>> $treeBuilder = new RemexHtml\TreeBuilder\TreeBuilder( $domBuilder );
>>> $dispatcher = new RemexHtml\TreeBuilder\Dispatcher( $treeBuilder );
>>> $tokenizer = new RemexHtml\Tokenizer\Tokenizer( $dispatcher, $innerHtml, [] );
>>> $tokenizer->execute( [ 'fragmentNamespace' => RemexHtml\HTMLData::NS_HTML, 'fragmentName' => $body->tagName ] );
>>> $fragmentWrapper = $domBuilder->getFragment();
>>> $importedFragmentWrapper = $body->ownerDocument->importNode( $domBuilder->getFragment(), true );
>>>
>>> $fragmentWrapper->ownerDocument->saveHTML( $fragmentWrapper );
=> "<html xmlns="http://www.w3.org/1999/xhtml"><div>4</div><!-- 5 -->6</html>"
>>> $importedFragmentWrapper->ownerDocument->saveHTML( $importedFragmentWrapper );
=> "<html xmlns="http://www.w3.org/1999/xhtml"><div>4</div><!-- 5 -->6</html>"
>>> $importedFirstChild = $doc->importNode( $fragmentWrapper->firstChild, true );
>>> $doc->saveHTML( $importedFirstChild );
=> "<div xmlns="http://www.w3.org/1999/xhtml">4</div>"
>>> Parsoid\Wt2Html\XMLSerializer::serialize( $importedFragmentWrapper, [ 'addDoctype' => false ] )['html']
=> "<html><div>4</div><!-- 5 -->6</html>"

First, why is the wrapper a <html> when I explicitly passed in body as a tagname? Maybe I'm using Remex the wrong way (the complete lack of documentation doesn't help) but as far as I can tell from the code it should build a fragment where the top-level element is of the type given in fragmentName, but that doesn't seem to work.

Second, Remex returns an element with an extra xmlns attribute, which behaves non-intuitively - if I import the wrapper and get its first child, the namespace is not inherited, if I get the child first and import that, it is. This is probably not too much of an issue, since XMLSerializer ignores these namespaces anyway.

The performance difference can be pretty large:

$ psysh
Psy Shell v0.9.9 (PHP 7.2.13-1+0~20181207100540.13+stretch~1.gbpf57305 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
>>> ($html = file_get_contents( 'obama.html' )) || true 
>>> $doc = new DOMDocument;
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag section invalid in Entity, line: 2 in Psy Shell code on line 1

>>> timeit -n1000 $doc->getElementById( 'cite_ref-290' );
Command took 0.000005 seconds on average (0.000003 median; 0.005050 total) to complete.
>>> timeit -n1000 ( new DOMXPath( $doc ) )->query( './/*[@id="cite_ref-290"]', $doc->documentElement )->item( 0 )
Command took 0.015575 seconds on average (0.014871 median; 15.574900 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( '*#cite_ref-290', $doc )[0]
Command took 13.418583 seconds on average (13.511294 median; 134.185828 total) to complete.

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( Parsoid\Utils\DOMCompat\SelectorToXPath::convert( 'sup' ), $doc->documentElement )
Command took 0.003930 seconds on average (0.003789 median; 0.393032 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( 'sup', $doc )
Command took 0.038177 seconds on average (0.035128 median; 0.381772 total) to complete.

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( Parsoid\Utils\DOMCompat\SelectorToXPath::convert( 'section > figure[typeof="mw:Image/Thumb"] a[rel="mw:WikiLink"]' ), $doc->documentElement )
Command took 0.009089 seconds on average (0.007681 median; 0.908874 total) to complete.
>>> timeit -n10 Wikimedia\Zest\Zest::find( 'section > figure[typeof="mw:Image/Thumb"] a[rel="mw:WikiLink"]', $doc )
Command took 4.690819 seconds on average (4.526054 median; 46.908187 total) to complete.
>>> timeit -n1000 $doc->getElementById( 'cite_ref-290' );
Command took 0.000004 seconds on average (0.000003 median; 0.004457 total) to complete.
>>> timeit -n1000 Parsoid\Utils\DOMCompat::getElementById( $doc, 'cite_ref-290' )
Command took 0.000010 seconds on average (0.000008 median; 0.009842 total) to complete.

So that's not too bad (granted the DOM tree is shallow, but Wikipedia articles usually are). But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

Where is that mentioned above? The only thing I see mentioned is that ->setIdAttribute() seems to screw up the index, but do we even intend to use that method?

Oh, Zest doesn't do any id indexing at all yet, because I couldn't figure out what the right way to feature-check that it was appropriate was. (I don't want to do the getElementsById optimization if the document doesn't have an id attribute registered.) I can fix that, but basically id-based matches are the worst case for zest right now. Changing the selectors to do class-based matching would be a fairer comparison, since (AFAIK) both xpath and zest would have to do a full-tree traversal for that.

(Zest has an optimization based on the element-by-tag-name index, as well. So that's something to watch when benchmarking.)

Here's another tiny benchmark, which points out the importance of indexes:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2311}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/.composereval()'d code on line 1
>>> timeit -n100 count(\Wikimedia\Zest\Zest::find('sup.mw-ref', $doc));
=> 540
Command took 0.030899 seconds on average (0.030505 median; 3.089879 total) to complete.
>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/sup[@class="mw-ref"]', $doc->documentElement);
=> DOMNodeList {#3038
     +length: 540,
   }
Command took 0.003295 seconds on average (0.003119 median; 0.329492 total) to complete.

But if you don't have an index (tag name or id):

>>> timeit -n10 count(\Wikimedia\Zest\Zest::find('.mw-ref', $doc));
=> 540
Command took 9.394339 seconds on average (9.081321 median; 93.943387 total) to complete.
>>> timeit -n10 ( new DOMXPath( $doc ) )->query( './/*[@class="mw-ref"]', $doc->documentElement);
=> DOMNodeList {#3039
     +length: 540,
   }
Command took 0.014997 seconds on average (0.013384 median; 0.149972 total) to complete.

Checking a different attribute to make sure this isn't another sneaky index (browsers do index by class name):

>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/*[@data-parsoid]', $doc->documentElement);
=> DOMNodeList {#3023
     +length: 7237,
   }
Command took 0.030151 seconds on average (0.030110 median; 3.015094 total) to complete.
>>> timeit -n3 count(\Wikimedia\Zest\Zest::find('*[data-parsoid]', $doc));=> 7237
Command took 7.589179 seconds on average (7.412123 median; 22.767538 total) to complete.

But it seems like it would be hard to go faster in PHP:

>>> timeit -n4 $cnt=0; foreach ($doc->getElementsByTagName('*') as $e) { if ($e->hasAttribute('data-parsoid')) { $cnt++; } }
Command took 9.847027 seconds on average (9.253243 median; 39.388110 total) to complete.
>>> $cnt;
=> 7237

Something is really slow here; it shouldn't take *that* long to look at 15,572 objects and do a query:

>>> ($el = $doc->getElementsByTagName('*')->item(42)) || true;
=> true
>>> timeit -n 15572 if ($el->hasAttribute('data-parsoid')) { $cnt++; }
Command took 0.000001 seconds on average (0.000001 median; 0.021440 total) to complete.
>>> timeit -n2 foreach ($doc->getElementsByTagName('*') as $e) { true; }
Command took 4.777420 seconds on average (4.730395 median; 9.554840 total) to complete.

So maybe we shouldn't use getElementsByTagName!

>>> timeit -n2 foreach (( new DOMXPath( $doc ) )->query( './/*', $doc->documentElement) as $e) { true; }
Command took 0.004565 seconds on average (0.004408 median; 0.009131 total) to complete.
>>> timeit -n2 $cnt=0; foreach (( new DOMXPath( $doc ) )->query( './/*', $doc->documentElement) as $e) { if ($e->hasAttribute('data-parsoid')) { $cnt++; } }
Command took 0.015675 seconds on average (0.011512 median; 0.031349 total) to complete.
>>> $cnt;
=> 7237
>>>

Sigh. Another method in PHP's DOM implementation which is, if not broken, at least unusably slow. At least we don't use this one directly in Parsoid...

Ok, latest version is now faster than XPath:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/Projects/Wikimedia/zest.phpeval()'d code on line 1
>>> timeit -n100 count(\Wikimedia\Zest\Zest::find('*[data-parsoid]', $doc));
=> 7237
Command took 0.024003 seconds on average (0.023284 median; 2.400329 total) to complete.
>>> timeit -n100 ( new DOMXPath( $doc ) )->query( './/*[@data-parsoid]', $doc->documentElement);
=> DOMNodeList {#8402
     +length: 7237,
   }
Command took 0.028514 seconds on average (0.027853 median; 2.851372 total) to complete.

All credit to @Tgr for doing the benchmark and finding this performance issue, and to peter at softcoded dot com whose comment on http://php.net/manual/en/class.domxpath.php clued me in to a namespace issue that was flummoxing me.

I'm going to go ahead and implement getElementsById on top of XPath as well, which should give us fast performance on ID comparisons as well (assuming XPath is using the index... I should double-check that).

So, apart from code review, what's left here is one of:

But as mentioned above it does not protect against the opposite failure (when ids in a freshly appended subtree are not indexed yet), and I don't think there's an easy workaround for that.

Where is that mentioned above? The only thing I see mentioned is that ->setIdAttribute() seems to screw up the index, but do we even intend to use that method?

In T215000#4994273, but in confusing way. There are two issues here:

Also the fact that calling setIdAttribute can unindex something makes me uncomfortable, given that we'll have to use it for T217696: Remex doesn't set ID attributes.

There's also an issue with namespaces, given that Remex creates nodes with a namespace set:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
PHP Warning:  DOMDocument::loadHTML(): Tag figure-inline invalid in Entity, line: 9 in /home/cananian/Projects/Wikimedia/zest.phpeval()'d code on line 1
>>> require('./tests/ZestTest.php')                                             => 1
>>> $doc2 = \Wikimedia\Zest\Tests\ZestTest::loadHtml("./obama.html"); /* uses remex */
>>> count( ( new DOMXPath( $doc ) )->query( './/p', $doc) );
=> 148
>>> count( ( new DOMXPath( $doc2 ) )->query( './/p', $doc2) );
=> 0

What's going on here is that remex set the namespaceURI, and XPath is sensitive to this:

>>> $doc->documentElement->namespaceURI;
=> null
>>> $doc2->documentElement->namespaceURI;
=> "http://www.w3.org/1999/xhtml"
>>> $xpath = new DOMXPath($doc2);
>>> $xpath->registerNamespace( 'ns', $doc2->documentElement->namespaceURI);
=> true
>>> count( $xpath->query( './/p', $doc2) );
=> 0
>>> count( $xpath->query( './/ns:p', $doc2) );
=> 148

Unfortunately, this has later ramifications if you use the Document#createElement() method (as Parsoid does):

>>> $el = $doc->createElement('p');
>>> $doc->documentElement->lastChild->appendChild($el);
>>> count( ( new DOMXPath( $doc ) )->query( './/p', $doc) );
=> 149
>>> $el = $doc2->createElement('p'); true;
=> true
>>> $doc2->documentElement->lastChild->appendChild($el); true;
=> true
>>> count( $xpath->query( './/ns:p', $doc2) );
=> 148

The new element was created without a namespace, so it doesn't show up in the queries that enumerate all the other elements.

FWIW, the "correct" (spec-compliant) behavior is for Document#createElement to inherit the namespaceURI of the parent:

> document.documentElement.namespaceURI
"http://www.w3.org/1999/xhtml"
> document.createElement('p').namespaceURI
"http://www.w3.org/1999/xhtml"

So Remex is correct here; it's the PHP DOM implementation of createElement which is broken.

So in practice we'll need a createElement wrapper? (And maybe a DOMDocument constructor wrapper to make sure they always have the right namespace, although I imagine manual document creation will only happen in tests)?

I'm suggesting T217708: Remex should offer an option to not set namespaceURI, which is a hacky workaround but means that we wouldn't have to further wrap createElement & friends.

Another fun bug -- if you use DOMDocument::loadHTML(), the top-level node (DOMDocument) has nodeType of 13 (which is not defined in any spec ever, not even DOM level 1) instead of 9 (which is what the spec says it should be).

It's still an instance of DOMDocument, though! I don't know why they managed to make DOMDocument return the wrong nodeType in this one particular case. <bangs head on desk>

PHP defines node type 13 as XML_HTML_DOCUMENT_NODE which, again, is not a real thing.

Maybe it is some misguided attempt to account for the difference between [[https://dom.spec.whatwg.org/#interface-document|Document]] in the DOM spec and [[https://html.spec.whatwg.org/multipage/dom.html#document|Document]] in the HTML spec? (Not that DOMDocument implements anything out of the latter interface...)
Should be worth an upstream report.

Seems to come from libxml: https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c#L2323

That said, PHP could override nodeType in order to provide a spec-compliant view, or hack a proper node type in. I guess it's more likely to be fixed in PHP than in libxml?

Based on T217766#5006890:

> $doc = new DOMDocument();
> $doc->loadHTML( '<p>Hello</p>' );
> $doc->getElementsByTagName('p')->item(0)->setAttribute('data-foo', '"<!-- foo&bar -->');
> echo $doc->saveHTML($doc->getElementsByTagName('body')->item(0));
<body><p data-foo='"<!-- foo&bar -->'>Hello</p></body>

which is invalid (because of the unencoded amperssand) and throws an error when reparsed. This only happens if the attribute contains something that looks like a comment, if you change <!-- to <!- it doesn't happen.

It appears that using $doc->saveXML( $node, LIBXML_NOEMPTYTAG ) instead of $doc->saveHTML( $node ) works around this without other negative effects, other than that smart quoting doesn't work (i.e. double quotes will always be used, even if using single quotes would be more efficient).

Weird. Luckily (?) for Parsoid, this seems to be a bug in saveHTML(), which I don't think we're actually using? setAttribute/getAttribute handle the "true" (unencoded) value of the attribute fine. PHP:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> $doc = new DOMDocument();
=> DOMDocument {#2308}
>>> $doc->loadHTML('<p>Hello</p>');
=> true
>>> $el = $doc->getElementsByTagName('p')->item(0);
=> DOMElement {#2325}
>>> $el->setAttribute('data-foo', '"<!-- foo&bar -->');
=> DOMAttr {#2311}
>>> $el->getAttribute('data-foo');
=> ""<!-- foo&bar -->"
>>> $doc->saveHTML($el);
=> "<p data-foo='"<!-- foo&bar -->'>Hello</p>"

vs domino (which should match browsers and the spec):

$ node
> var domino = require('./');
undefined
> var doc = domino.createDocument('<p>Hello</p>');
undefined
> el = doc.querySelector('p');
HTMLParagraphElement {}
> el.setAttribute('data-foo', '"<!-- foo&bar -->');
undefined
> el.getAttribute('data-foo');
'"<!-- foo&bar -->'
> el.outerHTML;
'<p data-foo="&quot;<!-- foo&amp;bar -->">Hello</p>'

I think our DOMCompatibility layer already includes our own XMLSerializer so we won't have to rely on PHP's saveHTML craziness?

What does Parsoid use to serialize DOMDocuments that doesn't use saveHTML() and/or doesn't suffer from this bug?

@Catrope: https://github.com/wikimedia/parsoid/blob/34964e38e431238ecada0f052b4b81a5a19db84d/src/Wt2Html/XMLSerializer.php but @ssastry says it's not terribly fast right now. We should look into that, maybe there are performance tweaks that are possible.

Symfony has a pretty mature looking CSS selector to XPath library. (I probably should have checked that before writing the same logic in 491892...) Should be benchmarked against Zest at some point.

I like this section from Symfony's page, which describes why it's nicer for long-term maintenance to use CSS selectors in the code:

When you're parsing an HTML or an XML document, by far the most powerful method is XPath.

XPath expressions are incredibly flexible, so there is almost always an XPath expression that will find the element you need. Unfortunately, they can also become very complicated, and the learning curve is steep. Even common operations (such as finding an element with a particular class) can require long and unwieldy expressions.

Many developers -- particularly web developers -- are more comfortable using CSS selectors to find elements. As well as working in stylesheets, CSS selectors are used in JavaScript with the querySelectorAll() function and in popular JavaScript libraries such as jQuery, Prototype and MooTools.

CSS selectors are less powerful than XPath, but far easier to write, read and understand. Since they are less powerful, almost all CSS selectors can be converted to an XPath equivalent. This XPath expression can then be used with other functions and classes that use XPath to find elements in a document.

They seem to claim pretty complete implementation of CSS; the parts they omit are also omitted in domino/zest for the most part (because there is no layout engine or UX associated with the dom). On the other hand, it's nice to have a ported Zest in my back pocket to hack into pieces if it turns out that XPath is hard to get working on an alternative DOM implementation.

We don't use this (I don't think) but be aware that if you create an attribute named 'xmlns' in PHP's DOM everything breaks: https://marc.info/?l=php-internals&m=155249142123136&w=2

Change 491892 had a related patch set uploaded (by C. Scott Ananian; owner: Gergő Tisza):
[mediawiki/services/parsoid@master] Add DOMCompat classes to support the DOM methods missing from PHP DOM

https://gerrit.wikimedia.org/r/491892

Change 491892 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add DOMCompat classes to support the DOM methods missing from PHP DOM

https://gerrit.wikimedia.org/r/491892

This functionality has now been merged into the Parsoid codebase and we are starting to use it. Thanks @Tgr and @cscott. Exposing this to user outside Parsoid would be a separate task.

Another difference we found: PHP's Node::normalize() doesn't actually remove zero-length text nodes, like the spec says it should (DOM level 2: https://www.w3.org/TR/DOM-Level-2-Core/core.html#ID-normalize ; latest DOM spec: https://dom.spec.whatwg.org/#dom-node-normalize ).

Reported upstream: https://bugs.php.net/bug.php?id=78221

Change 524385 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Work around PHP DOM's failure to normalize DOM during DOM build

https://gerrit.wikimedia.org/r/524385

Change 524385 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Work around PHP DOM's failure to normalize DOM during DOM build

https://gerrit.wikimedia.org/r/524385

Looks like this is a Remex bug, not a PHP DOM bug.

Another one: T235295: MathML tags are missing xmlns attribute -- PHP's DOM refuses to accept an attribute named xmlns for some reason: (Psy Shell v0.9.9 (PHP 7.3.4-2 — cli) by Justin Hileman)

>>> $doc = new \DOMDocument(); null
=> null
>>> $node = $doc->createElement('math'); null;
=> null
>>> $node->setAttributeNS('http://www.w3.org/2000/xmlns/', 'xmlns', 'http://www.w3.org/1998/Math/MathML');
=> null
>>> $node->attributes
=> DOMNamedNodeMap {#2335
     +length: 0,
   }
>>> $node->setAttribute('xmlns', 'xyz');
=> false
>>> $node->attributes
=> DOMNamedNodeMap {#2333
     +length: 0,
   }
>>> $node->setAttribute('x', 'xyz'); null;
=> null
>>> $node->attributes
=> DOMNamedNodeMap {#2322
     +length: 1,
   }
>>> $node2 = $doc->createElementNS('http://www.w3.org/1999/xhtml', 'math'); null;

=> null
>>> $node2->setAttributeNS('http://www.w3.org/2000/xmlns/', 'xmlns', 'http://www.w3.org/1998/Math/MathML');
=> null
>>> $node2->setAttribute('xmlns', 'xyz');
=> false
>>> $node2->attributes
=> DOMNamedNodeMap {#2317
     +length: 0,
   }
>>> $node2->setAttribute('x', 'xyz'); null;
=> null
>>> $node2->attributes
=> DOMNamedNodeMap {#2330
     +length: 1,
   }
>>>

https://github.com/php/php-src/blob/php-7.2.24/ext/dom/element.c#L411

xmlns is special, it gets used to output the namespace of a tag without being a formal "attribute".

php > $doc = new \DOMDocument();
php > $node2 = $doc->createElementNS('http://www.w3.org/1998/Math/MathML', 'math');
php > echo $doc->saveXML( $node2 );
<math xmlns="http://www.w3.org/1998/Math/MathML"/>

php > $node3 = $doc->createElement('math');
php > $node3->setAttribute('xmlns', 'http://www.w3.org/1998/Math/MathML');
php > echo $doc->saveXML( $node3 );
<math xmlns="http://www.w3.org/1998/Math/MathML"/>

Yeah, but that's not how the DOM is supposed to work.

But I found an ugly workaround: you can use the (deprecated!) methods $doc->createAttributeNode() and $node->setAttributeNode() to work around the crazy PHP behavior. (This isn't even libxml behavior.)

(The reason PHP's magic behavior is awful is that the attribute "exists" in some sense, but it's completely invisible to DOMNode::setAttribute/DOMNode::hasAttribute/etc. So even if we hacked our serializer to behave the same way that $doc->saveXML does and magically 'revive' the attribute, I'd have to go through and audit every setAttribute/hasAttribute etc in Parsoid to make sure that a hidden xmlns attribute isn't going to break things by being invisible yet present...)

Change 548913 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/libs/RemexHtml@master] WIP: Hack Remex to work around PHP DOM "special case" of xmlns attributes

https://gerrit.wikimedia.org/r/548913

It gets worse:

>>> $doc = DOMDocument::loadHTML('<p>foo');  null;
=> null
>>> $node = $doc->createElement('math'); null;
=> null
>>> $node->setAttribute('xmlns', 'foo');
=> true
>>> $node->getAttribute('xmlns')
=> "foo"
>>> $node->getAttributeNode('xmlns')
=> DOMNameSpaceNode {#2310
     +nodeName: "xmlns",
     +nodeValue: "",
     +nodeType: XML_NAMESPACE_DECL_NODE,
     +prefix: "",
     +localName: "xmlns",
     +namespaceURI: "foo",
     +ownerDocument: DOMDocument {#2305 …},
     +parentNode: null,
   }
>>> $node->setAttribute('x','y');
=> DOMAttr {#2316
...
   }
>>> $node->attributes
=> DOMNamedNodeMap {#2318
     +length: 1,
   }
>>> $node->attributes->item(0)
=> DOMAttr {#2316
...
   }
>>> $node->getAttributeNode('x')
=> DOMAttr {#2316
     +nodeName: "x",
     +nodeValue: "",
     +nodeType: XML_ATTRIBUTE_NODE,
     +parentNode: DOMElement {#2321 …},
[...]
     +namespaceURI: null,
     +prefix: "",
     +localName: "x",
     +baseURI: null,
     +textContent: "",
     +name: "x",
     +specified: true,
     +value: "y",
     +ownerElement: DOMElement {#2321 …},
     +schemaTypeInfo: null,
   }
>>> $node->getAttributeNode('xmlns') instanceof \DOMAttr
=> false
>>>

Note that the attribute node for the xmlns attribute is a DOMNameSpaceName, and the value of the attribute is stored in namespaceURI, not value. All the other attributes are instances of DOMAttr with value in value.

Change 548937 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Workaround for missing xmlns attributes on DOMElement

https://gerrit.wikimedia.org/r/548937

Change 548913 abandoned by C. Scott Ananian:
WIP: Hack Remex to work around PHP DOM "special case" of xmlns attributes

Reason:
Abandoned in favor of I7ef44ad9c1996749e9cf4306ac4d89cc5b8cc6e0

https://gerrit.wikimedia.org/r/548913

Change 548937 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Workaround for missing xmlns attributes on DOMElement

https://gerrit.wikimedia.org/r/548937