Page MenuHomePhabricator

Port domino (or another spec-compliant DOM library) to PHP
Open, LowPublic

Description

I'm biased toward domino because I've been maintaining it. But there's a need for a good latest-DOM-spec-compliant library for PHP that is secure -- ie, deliberately doesn't implement javascript execution, resource loading, document.write etc. ("Full" spec compliance means document.write, sandboxing, etc and while there are JS libraries that do this (jsdom for example) that's all functionality we actively don't want.)

Motivation is the long-and-growing list of bugs/inconsistencies/eccentricities in PHP's DOM implementation. See T215000: Fill gaps in PHP DOM's functionality, T217766: Flow\Exception\WikitextException: ParseEntityRef: no name, the existence of MCS' HtmlFormatter library (T217360), etc. for details.

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package. (On the other hand, the relatively fast pace of change in the DOM WG may mean that tying this to the PHP release cycle is not the best idea.)

A note about priority and dependencies: We are not going to do this as part of the Parsoid port. At this time we believe that we understand Parsoid's usage of the DOM well enough that we can workaround the bugs in the core PHP DOM implementation. But as a longer-term goal this would enhance maintainability and allow us to remove workarounds.

Event Timeline

cscott created this task.Mar 7 2019, 10:10 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 7 2019, 10:10 PM
ssastry triaged this task as Low priority.Mar 7 2019, 10:11 PM
ssastry updated the task description. (Show Details)
Anomie added subscribers: Smalyshev, Anomie.

I'm going to throw this on #cpt-backlog, mainly for the possibility of making it a PHP extension since Tim and I have some experience with that from LuaSandbox and Excimer. If it gets to that point we might talk to @Smalyshev too.

cscott updated the task description. (Show Details)Mar 7 2019, 10:31 PM

In theory (with infinite resources, etc) the best of all possible worlds would be a pure PHP implementation coupled with a "native" extension with more speed. Since (again in theory) both are implementing the exact same DOM API anyway, this would allow us to avoid adding an extension to mediawiki's required dependencies.

One note about porting domino in particular -- it uses meta-programming to generate classes corresponding to all the different HTML element types (HTMLAnchorElement, etc) from a compact specification (in htmlelts.js). Since neither PHP nor C support that kind of metaprogramming (eval doesn't count), that part of domino would have to be rewritten as a code-generator which runs during the build-phase instead. Probably not a huge deal, just something to keep in mind.

Another useful note, while I'm brain-dumping. Part of the task would be to define an appropriate PHP binding to WebIDL. There's a good start in packagist -- https://packagist.org/packages/esperecyan/webidl -- but it's implementation based. Someone should write a brief document describing how WebIDL maps to PHP. Unfortunately, the only non-JavaScript language that appears to have a format WebIDL binding description is Java, and they have "stopped work" on it and published it as a W3C note.

Tgr added a subscriber: Tgr.Mar 8 2019, 7:20 PM

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

cscott added a comment.Mar 8 2019, 8:41 PM

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

The idea would be to use the very basic parts of libxml, the node tree/child list data structure. I'm pretty sure the basic childNodes accessors are bug-free. But we could give a more nuanced bindings to (say) the Node#nodeType accessor so that any non-standard libxml types were mapped to the 'correct' ones; we don't have to directly expose the libxml mutators if they are broken. Similary, we would reimplement DOMDocument#loadHTML, DOMDocument#createElement, etc to make them spec-compliant instead of directly exposing some partly-broken interface from libxml.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Tgr added a comment.Mar 13 2019, 6:31 AM

Filed T218183: Audit uses of PHP DOM in Wikimedia software about listing where else such a library could be useful.

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Something to look at https://github.com/fitzgen/dodrio

cscott added a comment.Apr 8 2019, 9:39 PM

Yup, I'm keeping tabs on it. Recent comments indicate that they are not feeling too optimistic about being about to update DOM in core w/o breaking backward-compatibility, and they have deliberately de-scoped to just include "modern core DOM" *not* the HTML-specific DOM extensions. Of course these are entangled in various ways, so even "modern core DOM" includes the proper case of HTML tag names...

jlinehan added a subscriber: jlinehan.EditedMay 3 2019, 4:09 PM

As I previously mentioned to @cscott, who provided useful early guidance, I spent my 10% time in April working on a port of Domino.js to native PHP. Most of the core DOM functionality that was covered by Domino.js is complete, so I thought I ought to let folks know this has been going on. I didn't want to "lick the cookie," in case I ended up not getting very far.

The software is still a work in progress. The W3C and WHATWG test suites have not yet been ported. There are some inconsistencies in conventions, comments, stubs, etc. Events are not in the current code; they were implemented but I removed them to focus on core functionality and we can add them back later. See the README.md for more on what still needs to be done.

If this turns out to be something we'd want to continue with, then decisions need to be made about

  • how to handle readonly properties, accessors, etc. (__get()/__set()? {get,set}Property()? property()?)
  • how to test (W3C DOM-TS and WHATWG test suite do not target PHP, but can be made to)
  • how to integrate RemexHTML and zest.php
  • what subset of the DOM spec to support (performance and readability implications)

Anyone who wishes to be involved is welcome. I'll also continue working on it, but now that actual code exists, it might be good to talk specifics so that development goes in a direction beneficial to the Parsing team or whoever else could use it (T218183).

PHP magic methods (__get(), and __set()) used to be very slow, even Zend was discouraging use of those - I'm not sure if this is still an issue but honestly I don't recommend it. I'd go with standard {get,set}Property() -> it gives clear/understandable code and allows for proper type checking.

Anomie added a comment.May 3 2019, 5:22 PM

Cool! (:

  • how to handle readonly properties, accessors, etc. (__get()/__set()? {get,set}Property()? property()?)

__get()/__set() have extra overhead beyond that of a normal method call, and for Parsoid at least that performance impact would likely be significant. Modern MediaWiki seems to prefer {get,set}Property() rather than a property() that can both get and set, as the latter can sometimes be confusing or ambiguous.

jlinehan added a comment.EditedMay 3 2019, 6:30 PM

Welcoming all your comments!

Yes, I think (although I have not measured) __get()/__set() are non-viable here due to performance, but also type safety, readability, and all the other usual reasons. In case we wanted to be super faithful to the spec's idea of 'property', I included it.

I personally think getProperty/setProperty is the only way that is going to work . property() can't consistently detect which branch ('get' or 'set') is intended due to edge cases in default argument values, etc. But getProperty() differs the most from how things are named in the spec, so it's a decision I wanted to be made as a group.

The current implementation takes the "middle path" using property(), because then it's one method for one property, and easier for me to keep track of in the port. I always intended to take one of the extremes -- either adhering more literally to the spec with __get()/__set(), or relaxing that and using separate accessors.

Obviously, the actual backing properties are being used internally (where possible), and depending on how severely performance is impacted by the function overhead, they could be exposed in some cases. But before getting too far into that kind of thing, I'd like to have a benchmark in place to guide any optimization. This is more about a strategic decision of how to implement properties.

As I previously mentioned to @cscott, who provided useful early guidance, I spent my 10% time in April working on a port of Domino.js to native PHP. Most of the core DOM functionality that was covered by Domino.js is complete, so I thought I ought to let folks know this has been going on. I didn't want to "lick the cookie," in case I ended up not getting very far.
...
Anyone who wishes to be involved is welcome. I'll also continue working on it, but now that actual code exists, it might be good to talk specifics so that development goes in a direction beneficial to the Parsing team or whoever else could use it (T218183).

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

jlinehan added a comment.EditedMay 3 2019, 8:07 PM

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

I'm thinking an initial sanity check could be performance vs. PHP DOM on generic tasks, yes? What could be a good representative task / set of tasks? Build large document? Read, parse (with RemexHtml), modify, write? At least to get a handle on whether it's off by an order of magnitude or something and just isn't going to work.

I can use https://phabricator.wikimedia.org/T215000#4958564 for an idea of the number of calls, etc., but if there is more information available it could be helpful. Also, have you done any profiling of PHP DOM?

Anomie added a comment.May 3 2019, 8:34 PM

property() can't consistently detect which branch ('get' or 'set') is intended due to edge cases in default argument values, etc.

It looks like you can if you use func_num_args(): https://3v4l.org/A6aCc. I haven't tested the performance of that though.

jlinehan added a comment.EditedMay 3 2019, 10:48 PM

It looks like you can if you use func_num_args(): https://3v4l.org/A6aCc. I haven't tested the performance of that though.

Ah yes, that's true. It's another function call (and the branch, I guess) of added overhead, but if the difference in performance isn't serious, there might be some reasons to favor property() then.

  • getFoo/setFoo are common enough methods in the spec that it can be confusing to distinguish them from property accessors.
  • Reflected content attributes (exposed as properties like HTMLElement.name, value, data, type, etc.) having the potential to collide with class properties if blindly converted to accessors (I have not exhaustively checked this, it just seems possible)

Mostly QoL bikeshedding that can be worked around in various ways, but.

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

I'm thinking an initial sanity check could be performance vs. PHP DOM on generic tasks, yes? What could be a good representative task / set of tasks? Build large document? Read, parse (with RemexHtml), modify, write? At least to get a handle on whether it's off by an order of magnitude or something and just isn't going to work.

  • Build large document (and compare memory footprint, if possible)
  • Visit the entire DOM (this is a common operation in Parsoid)
  • Query selector performance (unless we are simply plugging in Zest)
  • Remove a child from a node with a lot of children - we initially had workarounds around domino's poor performance here but I think this was eventually fixed in Domino .. so, I doubt this will be an issue in the port.
  • And, you have a profile of common operations in Parsoid in the phab task comment you reference below

I can use https://phabricator.wikimedia.org/T215000#4958564 for an idea of the number of calls, etc., but if there is more information available it could be helpful.
Also, have you done any profiling of PHP DOM?

Not really. The closest was when we had some basic ports of a few DOM passes in T204614: Evaluate and document performance of one or two DOM transformers in node.js vs PHP. And RemexHtml perf in T204595: Evaluate and document performance of RemexHtml vs Domino.

@jlinehan, Early heads up from a project mgmt. POV. Once we are done with the porting and deployment, we might want to more seriously evaluate this ported library as a replacement for the PHP DOM. Without evaluation, especially performance, I do not want to say we will definitely be using this, but given all the bandaids we've been applying on top of PHP DOM, it is more likely we will want to consider this. I leave that evaluation up to @cscott and you.

Tgr added a comment.Wed, Sep 11, 5:18 PM

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.

(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

@jlinehan, Early heads up from a project mgmt. POV.

Thanks for the heads up. I have had my plate completely full with Better Use of Data work this last quarter, so I haven't had a chance to finish off evaluating this library. I can make some time in Q2 to make sure it gets there so it's easier for your team to make a decision about building on it.

The first thing is I need to get it loaded back in my own memory, and finish a couple dangling pieces and get a basic test setup to do some basic evaluation. From there maybe we could do some code review, with either @cscott or @Tgr or others who are interested. If we feel satisfied at that point, we can figure out who wants to work on it and what kind of things are needed or not needed from it, etc. Having me at the center of development is probably not ideal but I do feel responsible for getting things in working order and doing code review etc.

In terms of timelines, do you have an estimate for when you would want this to be ready to go? I'll leave it up to you, and I can build it into my quarterly planning that's happening now.

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.

Should evaluate this and any other alternatives. Obviously there are advantages and disadvantages to maintaining a library.

(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

Agree... Maybe we're all just modest by nature :)

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.
(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

I think the primary reason why some of that promotion doesn't happen is because we don't promise / cannot commit to ongoing maintenance wrt third-party of those libraries. We should probably move any additional discussion of this to T171073 :-)

@jlinehan, Early heads up from a project mgmt. POV.

Thanks for the heads up. I have had my plate completely full with Better Use of Data work this last quarter, so I haven't had a chance to finish off evaluating this library. I can make some time in Q2 to make sure it gets there so it's easier for your team to make a decision about building on it.

Ya, no problem. I don't think we were ready to fully take this on on our end either - I wanted to keep down the number of moving pieces during the port. I still optimistically anticipate that we'll be completely switched over to Parsoid/PHP (including switching off Parsoid/JS) by end of October. One of our explicit Q2 goals is to address any tech debt we incurred during the port. I think this task would squarely fall under that goal. Hence the early heads up and your Q2 timeline works well.

The first thing is I need to get it loaded back in my own memory, and finish a couple dangling pieces and get a basic test setup to do some basic evaluation. From there maybe we could do some code review, with either @cscott or @Tgr or others who are interested. If we feel satisfied at that point, we can figure out who wants to work on it and what kind of things are needed or not needed from it, etc. Having me at the center of development is probably not ideal but I do feel responsible for getting things in working order and doing code review etc.
In terms of timelines, do you have an estimate for when you would want this to be ready to go? I'll leave it up to you, and I can build it into my quarterly planning that's happening now.

Realistically, November is when we will probably get back to this. But, we have a bunch of other tech debt tasks as well, so we can reorder tasks to ensure you have had a chance to have another whack at this.

jlinehan added a comment.EditedWed, Sep 11, 6:05 PM

I still optimistically anticipate that we'll be completely switched over to Parsoid/PHP (including switching off Parsoid/JS) by end of October.

Great!

Realistically, November is when we will probably get back to this. But, we have a bunch of other tech debt tasks as well, so we can reorder tasks to ensure you have had a chance to have another whack at this.

Okay, I will see if this is a better fit for the start of the quarter or a mid-quarter sprint. I do know that if I leave it too long it will probably get crunched out from pressure to meet other topline quarterly goals, and I want to avoid that. Once I pin down an exact slot I'll post it here for reference.