Page MenuHomePhabricator

Remex doesn't uppercase tag/node names
Open, LowPublic


According to the DOM standard (eg and the Node#tagName (and thus also Node#nodeValue should be uppercase:

The tagName attribute’s getter must return the context object’s HTML-uppercased qualified name.

Remex -- and the standard PHP DOMDocument#loadHTML method -- use lowercase tag- and node names:

$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> ($html = file_get_contents( 'obama.html' )) || true;
=> true
>>> ($doc = new DOMDocument) || true;
=> true
>>> $doc->loadHTML($html);
>>> require('./tests/ZestTest.php')
=> 1
>>> $doc2 = \Wikimedia\Zest\Tests\ZestTest::loadHtml("./obama.html"); /* uses remex */
>>> $doc->documentElement->firstChild->tagName;
=> "head"
>>> $doc2->documentElement->firstChild->tagName;
=> "head"
>>> $doc->documentElement->firstChild->nodeName;
=> "head"
>>> $doc2->documentElement->firstChild->nodeName;
=> "head"

The PHP DOM implementation respects case-sensitivity (which it actually shouldn't):

>>> $doc2->createElement('p')->nodeName;
=> "p"
>>> $doc->createElement('p')->nodeName;
=> "p"
>>> $doc2->createElement('P')->nodeName;
=> "P"
>>> $doc->createElement('P')->nodeName;
=> "P"

Compare to JS in the browser:

> document.createElement('p').tagName

Remex should probably:

  1. provide an option to uppercase HTML tag names prior to passing them to createElement(), and/or
  2. allow passing in a different DOMImplementation to RemexHtml\DOM\DOMBuilder to provide proper behavior for html/

Option 1 would, in the short term, allow Parsoid to continue to use uppercase when comparing tagName strings; it would have to take care to always use uppercase when calling createElement though. This would be a bridge to option 2, once we have a proper spec-compliant DOM implementation (T215000).

Event Timeline

A couple of notes about this which cscott is certainly aware of, but in case there is any confusion for others reading this task:

Thus with JS in the browser:

> document.createElement('p').localName

As cscott says, the nodeName in the DOM spec is the "HTML-uppercased qualified name". But the uppercasing operation is the responsibility of the DOM.

PHP's DOM extension does not follow the WHATWG DOM spec. The DOM extension is explicitly an XML DOM, and as such it is required to be case-preserving.

So it's technically non-compliant for RemexHtml to uppercase the localName so that the DOM will be fooled into providing a correct nodeName. Of course it could be supported if Parsoid needed it. But my understanding is that Parsoid dealt with the problem in another way.

So I am leaning towards declining this.