Page MenuHomePhabricator

Remex should offer an option to not set namespaceURI
Closed, ResolvedPublic

Description

PHP's DOMDocument#createElement is broken: it always sets the namespaceURI of the created element to null. Unfortunately, this makes the createdElement different from all the other elements created by a Remex document parse, which have namespaceURI (correctly and spec-compliantly) to http://www.w3.org/1999/xhtml.

See T215000#5003044 for full details.

We should eventually use a spec-compliant DOM implementation (T215000, T217867). In the meantime, it would be helpful if Remex exposed an option to *not* set the namespaceURI on its created documents, so that parsed elements and constructed elements would be consistent. (Inconsistent namespaces break XPath queries, for example.)

Event Timeline

Internally, Remex seems to be using the HTML namespace to trigger all kinds of HTML-specific parsing behavior, so that would have to be refactored.

I don't think it needs to be refactored. You can just have an option to DOMBuilder which tells it to ignore namespaces. TreeBuilder needs to keep track of namespaces, and it stores them in Element objects, but Element objects are just temporary state, there is no requirement for DOMBuilder to retain the information from Element.

Element namespaces are specified in detail in the HTML 5 parsing spec, and are required for compliance, in order to support MathML and SVG fragments embedded in HTML. Elements with the same tag name can have different content models depending on the namespace they are in. But once the element comes out of TreeBuilder you are free to throw away whatever you like.

Yeah, that sounds right. Option to DOMBuilder.

I'm hoping we eventually get a utility class which allows common tasks like "load an HTML string into a DOM tree" in one method call (T217850), so I'm really thinking of "option" as "option passed to the utility class" which then gets distributed to whatever part of the remex pipeline is appropriate (in this case DOMBuilder).

Change 495460 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/libs/RemexHtml@master] Provide an option to suppress namespace for HTML elements

https://gerrit.wikimedia.org/r/495460

Change 495460 merged by jenkins-bot:
[mediawiki/libs/RemexHtml@master] Provide an option to suppress namespace for HTML elements

https://gerrit.wikimedia.org/r/495460

cscott claimed this task.