Page MenuHomePhabricator

Port domino (or another spec-compliant DOM library) to PHP
Open, LowPublic

Description

I'm biased toward domino because I've been maintaining it. But there's a need for a good latest-DOM-spec-compliant library for PHP that is secure -- ie, deliberately doesn't implement javascript execution, resource loading, document.write etc. ("Full" spec compliance means document.write, sandboxing, etc and while there are JS libraries that do this (jsdom for example) that's all functionality we actively don't want.)

Motivation is the long-and-growing list of bugs/inconsistencies/eccentricities in PHP's DOM implementation. See T215000: Fill gaps in PHP DOM's functionality, T217766: Flow\Exception\WikitextException: ParseEntityRef: no name, the existence of MCS' HtmlFormatter library (T217360), etc. for details.

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package. (On the other hand, the relatively fast pace of change in the DOM WG may mean that tying this to the PHP release cycle is not the best idea.)

A note about priority and dependencies: We are not going to do this as part of the Parsoid port. At this time we believe that we understand Parsoid's usage of the DOM well enough that we can workaround the bugs in the core PHP DOM implementation. But as a longer-term goal this would enhance maintainability and allow us to remove workarounds.

Event Timeline

cscott created this task.Mar 7 2019, 10:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2019, 10:10 PM
ssastry triaged this task as Low priority.Mar 7 2019, 10:11 PM
ssastry updated the task description. (Show Details)
Anomie added subscribers: Smalyshev, Anomie.

I'm going to throw this on Core Platform Team Backlog, mainly for the possibility of making it a PHP extension since Tim and I have some experience with that from LuaSandbox and Excimer. If it gets to that point we might talk to @Smalyshev too.

cscott updated the task description. (Show Details)Mar 7 2019, 10:31 PM

In theory (with infinite resources, etc) the best of all possible worlds would be a pure PHP implementation coupled with a "native" extension with more speed. Since (again in theory) both are implementing the exact same DOM API anyway, this would allow us to avoid adding an extension to mediawiki's required dependencies.

One note about porting domino in particular -- it uses meta-programming to generate classes corresponding to all the different HTML element types (HTMLAnchorElement, etc) from a compact specification (in htmlelts.js). Since neither PHP nor C support that kind of metaprogramming (eval doesn't count), that part of domino would have to be rewritten as a code-generator which runs during the build-phase instead. Probably not a huge deal, just something to keep in mind.

Another useful note, while I'm brain-dumping. Part of the task would be to define an appropriate PHP binding to WebIDL. There's a good start in packagist -- https://packagist.org/packages/esperecyan/webidl -- but it's implementation based. Someone should write a brief document describing how WebIDL maps to PHP. Unfortunately, the only non-JavaScript language that appears to have a format WebIDL binding description is Java, and they have "stopped work" on it and published it as a W3C note.

Tgr added a subscriber: Tgr.Mar 8 2019, 7:20 PM

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

cscott added a comment.Mar 8 2019, 8:41 PM

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

The idea would be to use the very basic parts of libxml, the node tree/child list data structure. I'm pretty sure the basic childNodes accessors are bug-free. But we could give a more nuanced bindings to (say) the Node#nodeType accessor so that any non-standard libxml types were mapped to the 'correct' ones; we don't have to directly expose the libxml mutators if they are broken. Similary, we would reimplement DOMDocument#loadHTML, DOMDocument#createElement, etc to make them spec-compliant instead of directly exposing some partly-broken interface from libxml.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Tgr added a comment.Mar 13 2019, 6:31 AM

Filed T218183: Audit uses of PHP DOM in Wikimedia software about listing where else such a library could be useful.

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Something to look at https://github.com/fitzgen/dodrio

cscott added a comment.Mon, Apr 8, 9:39 PM

Yup, I'm keeping tabs on it. Recent comments indicate that they are not feeling too optimistic about being about to update DOM in core w/o breaking backward-compatibility, and they have deliberately de-scoped to just include "modern core DOM" *not* the HTML-specific DOM extensions. Of course these are entangled in various ways, so even "modern core DOM" includes the proper case of HTML tag names...