Page MenuHomePhabricator

Port domino (or another spec-compliant DOM library) to PHP
Open, LowPublic

Description

I'm biased toward domino because I've been maintaining it. But there's a need for a good latest-DOM-spec-compliant library for PHP that is secure -- ie, deliberately doesn't implement javascript execution, resource loading, document.write etc. ("Full" spec compliance means document.write, sandboxing, etc and while there are JS libraries that do this (jsdom for example) that's all functionality we actively don't want.)

Motivation is the long-and-growing list of bugs/inconsistencies/eccentricities in PHP's DOM implementation. See T215000: Fill gaps in PHP DOM's functionality, T217766: Flow\Exception\WikitextException: ParseEntityRef: no name, the existence of MCS' HtmlFormatter library (T217360), etc. for details.

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package. (On the other hand, the relatively fast pace of change in the DOM WG may mean that tying this to the PHP release cycle is not the best idea.)

A note about priority and dependencies: We are not going to do this as part of the Parsoid port. At this time we believe that we understand Parsoid's usage of the DOM well enough that we can workaround the bugs in the core PHP DOM implementation. But as a longer-term goal this would enhance maintainability and allow us to remove workarounds.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ssastry updated the task description. (Show Details)
Anomie added subscribers: Smalyshev, Anomie.

I'm going to throw this on #cpt-backlog, mainly for the possibility of making it a PHP extension since Tim and I have some experience with that from LuaSandbox and Excimer. If it gets to that point we might talk to @Smalyshev too.

In theory (with infinite resources, etc) the best of all possible worlds would be a pure PHP implementation coupled with a "native" extension with more speed. Since (again in theory) both are implementing the exact same DOM API anyway, this would allow us to avoid adding an extension to mediawiki's required dependencies.

One note about porting domino in particular -- it uses meta-programming to generate classes corresponding to all the different HTML element types ([HTMLAnchorElement](https://developer.mozilla.org/en-US/docs/Web/API/HTMLAnchorElement), etc) from a compact specification (in htmlelts.js). Since neither PHP nor C support that kind of metaprogramming (eval doesn't count), that part of domino would have to be rewritten as a code-generator which runs during the build-phase instead. Probably not a huge deal, just something to keep in mind.

Another useful note, while I'm brain-dumping. Part of the task would be to define an appropriate PHP binding to WebIDL. There's a good start in packagist -- https://packagist.org/packages/esperecyan/webidl -- but it's implementation based. Someone should write a brief document describing how WebIDL maps to PHP. Unfortunately, the only non-JavaScript language that appears to have a format WebIDL binding description is Java, and they have "stopped work" on it and published it as a W3C note.

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

An alternative is to port domino/etc to C directly and have it be usable as a PHP extension so we get good perf as well. If it used libxml's nodes underneath you could still do fast XPath queries, etc, using the existing DOMXPath package.

AIUI most of the bugs come from libxml, not PHP directly, so that wouldn't improve the situation much.

The idea would be to use the very basic parts of libxml, the node tree/child list data structure. I'm pretty sure the basic childNodes accessors are bug-free. But we could give a more nuanced bindings to (say) the Node#nodeType accessor so that any non-standard libxml types were mapped to the 'correct' ones; we don't have to directly expose the libxml mutators if they are broken. Similary, we would reimplement DOMDocument#loadHTML, DOMDocument#createElement, etc to make them spec-compliant instead of directly exposing some partly-broken interface from libxml.

As a general comment, XML/DOM library is probably one of the areas where performance would be critical, so C port would be great, but that would probably require serious resource investment. All existing PHP libraries AFAIK base on libxml2, so I wonder whether the problems we're having reside in PHP bindings or libxml2, and whether it may be less effort to locate and resolve these instead of starting a new clean plate effort. Or, alternatively, find another C/C++ DOM library and make a binding for that one?

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Filed T218183: Audit uses of PHP DOM in Wikimedia software about listing where else such a library could be useful.

We've been looking for alternate DOM libraries and they don't seem to be common, alas.

Something to look at https://github.com/fitzgen/dodrio

Yup, I'm keeping tabs on it. Recent comments indicate that they are not feeling too optimistic about being about to update DOM in core w/o breaking backward-compatibility, and they have deliberately de-scoped to just include "modern core DOM" *not* the HTML-specific DOM extensions. Of course these are entangled in various ways, so even "modern core DOM" includes the proper case of HTML tag names...

As I previously mentioned to @cscott, who provided useful early guidance, I spent my 10% time in April working on a port of Domino.js to native PHP. Most of the core DOM functionality that was covered by Domino.js is complete, so I thought I ought to let folks know this has been going on. I didn't want to "lick the cookie," in case I ended up not getting very far.

The software is still a work in progress. The W3C and WHATWG test suites have not yet been ported. There are some inconsistencies in conventions, comments, stubs, etc. Events are not in the current code; they were implemented but I removed them to focus on core functionality and we can add them back later. See the README.md for more on what still needs to be done.

If this turns out to be something we'd want to continue with, then decisions need to be made about

  • how to handle readonly properties, accessors, etc. (__get()/__set()? {get,set}Property()? property()?)
  • how to test (W3C DOM-TS and WHATWG test suite do not target PHP, but can be made to)
  • how to integrate RemexHTML and zest.php
  • what subset of the DOM spec to support (performance and readability implications)

Anyone who wishes to be involved is welcome. I'll also continue working on it, but now that actual code exists, it might be good to talk specifics so that development goes in a direction beneficial to the Parsing team or whoever else could use it (T218183).

PHP magic methods (__get(), and __set()) used to be very slow, even Zend was discouraging use of those - I'm not sure if this is still an issue but honestly I don't recommend it. I'd go with standard {get,set}Property() -> it gives clear/understandable code and allows for proper type checking.

Cool! (:

  • how to handle readonly properties, accessors, etc. (__get()/__set()? {get,set}Property()? property()?)

__get()/__set() have extra overhead beyond that of a normal method call, and for Parsoid at least that performance impact would likely be significant. Modern MediaWiki seems to prefer {get,set}Property() rather than a property() that can both get and set, as the latter can sometimes be confusing or ambiguous.

Welcoming all your comments!

Yes, I think (although I have not measured) __get()/__set() are non-viable here due to performance, but also type safety, readability, and all the other usual reasons. In case we wanted to be super faithful to the spec's idea of 'property', I included it.

I personally think getProperty/setProperty is the only way that is going to work . property() can't consistently detect which branch ('get' or 'set') is intended due to edge cases in default argument values, etc. But getProperty() differs the most from how things are named in the spec, so it's a decision I wanted to be made as a group.

The current implementation takes the "middle path" using property(), because then it's one method for one property, and easier for me to keep track of in the port. I always intended to take one of the extremes -- either adhering more literally to the spec with __get()/__set(), or relaxing that and using separate accessors.

Obviously, the actual backing properties are being used internally (where possible), and depending on how severely performance is impacted by the function overhead, they could be exposed in some cases. But before getting too far into that kind of thing, I'd like to have a benchmark in place to guide any optimization. This is more about a strategic decision of how to implement properties.

As I previously mentioned to @cscott, who provided useful early guidance, I spent my 10% time in April working on a port of Domino.js to native PHP. Most of the core DOM functionality that was covered by Domino.js is complete, so I thought I ought to let folks know this has been going on. I didn't want to "lick the cookie," in case I ended up not getting very far.
...
Anyone who wishes to be involved is welcome. I'll also continue working on it, but now that actual code exists, it might be good to talk specifics so that development goes in a direction beneficial to the Parsing team or whoever else could use it (T218183).

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

I'm thinking an initial sanity check could be performance vs. PHP DOM on generic tasks, yes? What could be a good representative task / set of tasks? Build large document? Read, parse (with RemexHtml), modify, write? At least to get a handle on whether it's off by an order of magnitude or something and just isn't going to work.

I can use https://phabricator.wikimedia.org/T215000#4958564 for an idea of the number of calls, etc., but if there is more information available it could be helpful. Also, have you done any profiling of PHP DOM?

property() can't consistently detect which branch ('get' or 'set') is intended due to edge cases in default argument values, etc.

It looks like you can if you use func_num_args(): https://3v4l.org/A6aCc. I haven't tested the performance of that though.

It looks like you can if you use func_num_args(): https://3v4l.org/A6aCc. I haven't tested the performance of that though.

Ah yes, that's true. It's another function call (and the branch, I guess) of added overhead, but if the difference in performance isn't serious, there might be some reasons to favor property() then.

  • getFoo/setFoo are common enough methods in the spec that it can be confusing to distinguish them from property accessors.
  • Reflected content attributes (exposed as properties like HTMLElement.name, value, data, type, etc.) having the potential to collide with class properties if blindly converted to accessors (I have not exhaustively checked this, it just seems possible)

Mostly QoL bikeshedding that can be worked around in various ways, but.

Nice work! Quick initial comment now. As far as Parsoid is concerned, assuming feature parity wrt functionality, performance will be the biggest determining factor here since Parsoid's functionality is heavily DOM based.

I'm thinking an initial sanity check could be performance vs. PHP DOM on generic tasks, yes? What could be a good representative task / set of tasks? Build large document? Read, parse (with RemexHtml), modify, write? At least to get a handle on whether it's off by an order of magnitude or something and just isn't going to work.

  • Build large document (and compare memory footprint, if possible)
  • Visit the entire DOM (this is a common operation in Parsoid)
  • Query selector performance (unless we are simply plugging in Zest)
  • Remove a child from a node with a lot of children - we initially had workarounds around domino's poor performance here but I think this was eventually fixed in Domino .. so, I doubt this will be an issue in the port.
  • And, you have a profile of common operations in Parsoid in the phab task comment you reference below

I can use https://phabricator.wikimedia.org/T215000#4958564 for an idea of the number of calls, etc., but if there is more information available it could be helpful.
Also, have you done any profiling of PHP DOM?

Not really. The closest was when we had some basic ports of a few DOM passes in T204614: Evaluate and document performance of one or two DOM transformers in node.js vs PHP. And RemexHtml perf in T204595: Evaluate and document performance of RemexHtml vs Domino.

@jlinehan, Early heads up from a project mgmt. POV. Once we are done with the porting and deployment, we might want to more seriously evaluate this ported library as a replacement for the PHP DOM. Without evaluation, especially performance, I do not want to say we will definitely be using this, but given all the bandaids we've been applying on top of PHP DOM, it is more likely we will want to consider this. I leave that evaluation up to @cscott and you.

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.

(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

@jlinehan, Early heads up from a project mgmt. POV.

Thanks for the heads up. I have had my plate completely full with Better Use of Data work this last quarter, so I haven't had a chance to finish off evaluating this library. I can make some time in Q2 to make sure it gets there so it's easier for your team to make a decision about building on it.

The first thing is I need to get it loaded back in my own memory, and finish a couple dangling pieces and get a basic test setup to do some basic evaluation. From there maybe we could do some code review, with either @cscott or @Tgr or others who are interested. If we feel satisfied at that point, we can figure out who wants to work on it and what kind of things are needed or not needed from it, etc. Having me at the center of development is probably not ideal but I do feel responsible for getting things in working order and doing code review etc.

In terms of timelines, do you have an estimate for when you would want this to be ready to go? I'll leave it up to you, and I can build it into my quarterly planning that's happening now.

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.

Should evaluate this and any other alternatives. Obviously there are advantages and disadvantages to maintaining a library.

(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

Agree... Maybe we're all just modest by nature :)

https://github.com/ivopetkov/html5-dom-document-php was mentioned in the last JetBrains newsletter as a similar library.

(Wouldn't it be nice if our libraries would be mentioned in major newsletters as well? waves at T171073 and ducks)

I think the primary reason why some of that promotion doesn't happen is because we don't promise / cannot commit to ongoing maintenance wrt third-party of those libraries. We should probably move any additional discussion of this to T171073 :-)

@jlinehan, Early heads up from a project mgmt. POV.

Thanks for the heads up. I have had my plate completely full with Better Use of Data work this last quarter, so I haven't had a chance to finish off evaluating this library. I can make some time in Q2 to make sure it gets there so it's easier for your team to make a decision about building on it.

Ya, no problem. I don't think we were ready to fully take this on on our end either - I wanted to keep down the number of moving pieces during the port. I still optimistically anticipate that we'll be completely switched over to Parsoid/PHP (including switching off Parsoid/JS) by end of October. One of our explicit Q2 goals is to address any tech debt we incurred during the port. I think this task would squarely fall under that goal. Hence the early heads up and your Q2 timeline works well.

The first thing is I need to get it loaded back in my own memory, and finish a couple dangling pieces and get a basic test setup to do some basic evaluation. From there maybe we could do some code review, with either @cscott or @Tgr or others who are interested. If we feel satisfied at that point, we can figure out who wants to work on it and what kind of things are needed or not needed from it, etc. Having me at the center of development is probably not ideal but I do feel responsible for getting things in working order and doing code review etc.

In terms of timelines, do you have an estimate for when you would want this to be ready to go? I'll leave it up to you, and I can build it into my quarterly planning that's happening now.

Realistically, November is when we will probably get back to this. But, we have a bunch of other tech debt tasks as well, so we can reorder tasks to ensure you have had a chance to have another whack at this.

I still optimistically anticipate that we'll be completely switched over to Parsoid/PHP (including switching off Parsoid/JS) by end of October.

Great!

Realistically, November is when we will probably get back to this. But, we have a bunch of other tech debt tasks as well, so we can reorder tasks to ensure you have had a chance to have another whack at this.

Okay, I will see if this is a better fit for the start of the quarter or a mid-quarter sprint. I do know that if I leave it too long it will probably get crunched out from pressure to meet other topline quarterly goals, and I want to avoid that. Once I pin down an exact slot I'll post it here for reference.

@ssastry Okay, time for an update. I spent the last few days re-visiting the PHP library and started the necessary finishing.

Some preliminary numbers look good -- insertion performance is good, building a 50,000 element tree takes about half the time of PHP DOMDocument which is not mind-blowing but still good. I'm setting up something more comprehensive while finishing some functions and documentation, but no red flags yet in terms of performance.

I'm at an offsite this week but afterwards I will update this ticket with more details and try to set up a time to chat if people are ready for that.

@ssastry Okay, time for an update. I spent the last few days re-visiting the PHP library and started the necessary finishing.

Some preliminary numbers look good -- insertion performance is good, building a 50,000 element tree takes about half the time of PHP DOMDocument which is not mind-blowing but still good. I'm setting up something more comprehensive while finishing some functions and documentation, but no red flags yet in terms of performance.

I'm at an offsite this week but afterwards I will update this ticket with more details and try to set up a time to chat if people are ready for that.

Excellent! Thanks for circling back to this and the update! We are happy to consider this as long as there is no substantial perf. degradation, so, we'll take the 2x improvement. :-)

Also, curious about the memory footprint, especially on larger documents since we see a lot of these in production.

A PHP RFC has been accepted to implement the WHATWG Living Standard DOM as an extension in PHP 8.0.

A PHP RFC has been accepted to implement the WHATWG Living Standard DOM as an extension in PHP 8.0.

Wow it's about time!

Well, it looks like PHP 8 is anticipated in Q4 2020-Q1 2021, if they follow the release process. I'm assuming we'd all prefer to use that solution if it were available. It seems close enough to wait, assuming the current solution is stable enough to last a year plus however long it takes WMF to upgrade to PHP 8. What do you think @ssastry?

@jlinehan Unfortunately I believe that RFC was a bit overly optimistic/naive. Once they started diving into the details (with some feedback from us) they realized they couldn't actually implement the W3C spec and still maintain backwards-compliance w/ the existing DOM extension. I'm surprised the RFC went forward, honestly; I thought they'd agreed to change tack.

Anyway, I commented on the github here: https://github.com/php/php-src/pull/4709#issuecomment-555079767

An aggressive commitment to fixing the bugs even if it breaks old code would indeed be interesting, but just adding methods without fixing the bugs in the existing implementation would not make the built-in DOM any more attractive.

@jlinehan Once they started diving into the details (with some feedback from us) they realized they couldn't actually implement the W3C spec and still maintain backwards-compliance w/ the existing DOM extension

@cscott you're kidding, so the plan was/is to re-use the same extension code, just with a different interface? And a few new methods? That's not what I was expecting.

Given that I'll continue under the premise that this code will be useful (unless more info comes to light).

Well, it looks like PHP 8 is anticipated in Q4 2020-Q1 2021, if they follow the release process.

I think the relevant date would be MediaWiki requiring PHP 8, which is probably at least two years later (was 4 years for PHP 7).

A PHP RFC has been accepted to implement the WHATWG Living Standard DOM as an extension in PHP 8.0.

Wow it's about time!

Well, it looks like PHP 8 is anticipated in Q4 2020-Q1 2021, if they follow the release process. I'm assuming we'd all prefer to use that solution if it were available. It seems close enough to wait, assuming the current solution is stable enough to last a year plus however long it takes WMF to upgrade to PHP 8. What do you think @ssastry?

Yes, I think we still want to evaluate the domino port option.

To elaborate a little bit, our intention is to tie a bow around production deployment of Parsoid/PHP in the next month or so, then January (roughly) after we've retired Parsoid/JS start looking at code debt and integration issues with the Parsoid/PHP codebase. Moving away from PHP's dom extension is maybe not #1 on the list of things to do in that phase, but probably in the top 5 at least.

Given the timeline, having a spec-compliant PHP DOM implementation could even be useful to push for PHP's native DOM implementation being decent. Especially if it comes with a test suite which covers all the current PHP DOM compliance errors.

To elaborate a little bit, our intention is to tie a bow around production deployment of Parsoid/PHP in the next month or so, then January (roughly) after we've retired Parsoid/JS start looking at code debt and integration issues with the Parsoid/PHP codebase. Moving away from PHP's dom extension is maybe not #1 on the list of things to do in that phase, but probably in the top 5 at least.

Good this is helpful for me to keep things going. On my end, I have 2 other projects that are tied to OKRs that are higher priority, but the software here is still in decent shape and moving forward. I'll keep updating this ticket. Probably around EOY there will be something with adequate coverage and some more numbers.

Given the timeline, having a spec-compliant PHP DOM implementation could even be useful to push for PHP's native DOM implementation being decent. Especially if it comes with a test suite which covers all the current PHP DOM compliance errors.

It would make for a pretty good blog / mailing list post. It's still incredible to me that there is no better community-maintained reference implementation of WHATWG DOM in C that an extension could build on. Or that libxml2 is apparently so neglected...

@jlinehan, are you able to work on this this month?

@jlinehan, are you able to work on this this month?

I think so. I'm doing roadmapping this week with my managers and will update this ticket end-of-week with some sort of rough timeline.

@jlinehan, are you able to work on this this month?

I think so. I'm doing roadmapping this week with my managers and will update this ticket end-of-week with some sort of rough timeline.

Checking in here again.

@Esanders pointed me at https://github.com/TRowbotham/PHPDOM, which has similar goals. A brief examination of https://github.com/TRowbotham/PHPDOM/blob/master/src/Support/Collection/NodeSet.php (used for storing a Node's children) seems to indicate that it won't have adequate performance for arbitrary inserts (which require dynamically switching to a linked-list structure), but it seems well-enough engineered that adding this change-of-representation code wouldn't be difficult.

We should do a performance bake-off, perhaps? (Also also retrofit RemexHtml into whatever library we choose to handle the HTML parsing aspects of innerHTML setters, etc.)

@Esanders pointed me at https://github.com/TRowbotham/PHPDOM, which has similar goals.

Worth noting: "This is very much a work in progress and as a result things may be broken." and no commits since July 2018.

Worth noting: "This is very much a work in progress and as a result things may be broken." and no commits since July 2018.

Yeah, I would think it would be better from our perspective to have something that looks like something at least a few of us are familiar with (Domino, etc) and is able to be actively developed (at least 1-2 of us knows why it is the way it is), and was built in-house specifically for our needs.

But! PHPDOM might still be a good benchmark to flush out any quirks with one implementation or the other. It depends on how much work we'd need to do to get it rigged up, but it could be worthwhile to pinpoint any issues either with coverage or with perf. Of course, if it's running circles around the domino port for a bunch of reasons that can't be fixed in a week, that's a whole different conversation. That would be good to know, too!

Given that the other option is not usable out of the box, I think we should pursue our port further.

FWIW it was mentioned by @Esanders because it has an implementation of https://developer.mozilla.org/en-US/docs/Web/API/Range/cloneContents, and apparently DiscussionTools uses Range on the JavaScript side to represent "the subtree representing a comment". Domino doesn't have an implementation of Range. So there are definitely pieces we could probably lift from one or the other codebase. The licenses are compatible.

FWIW it was mentioned by @Esanders because it has an implementation of https://developer.mozilla.org/en-US/docs/Web/API/Range/cloneContents, and apparently DiscussionTools uses Range on the JavaScript side to represent "the subtree representing a comment". Domino doesn't have an implementation of Range. So there are definitely pieces we could probably lift from one or the other codebase. The licenses are compatible.

Yeah that is necessary. There may be other coverage issues if the scope has grown past the parser. Should we be building a list of DOM API dependencies for our target use cases?

Change 708619 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/DiscussionTools@master] WIP: Don't refer directly to PHP `dom` extension classes

https://gerrit.wikimedia.org/r/708619

Change 708619 abandoned by C. Scott Ananian:

[mediawiki/extensions/DiscussionTools@master] WIP: Don't refer directly to PHP `dom` extension classes

Reason:

Abandoned in favor of I3c4f41c3819770f85d68157c9f690d650b7266a3

https://gerrit.wikimedia.org/r/708619

Change 709061 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/DiscussionTools@master] WIP: Don't refer directly to PHP `dom` extension classes

https://gerrit.wikimedia.org/r/709061

Change 709061 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Don't refer directly to PHP `dom` extension classes; avoid nonstandard behavior

https://gerrit.wikimedia.org/r/709061

Aklapper added a subscriber: Tim.abdullin.

@Tim.abdullin: Per emails from Sep18 and Oct20 and https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup , I am resetting the assignee of this task because there has not been progress lately (please correct me if I am wrong!). Resetting the assignee avoids the impression that somebody is already working on this task. It also allows others to potentially work towards fixing this task. Please claim this task again when you plan to work on it (via Add Action...Assign / Claim in the dropdown menu) - it would be welcome. Thanks for your understanding!