Page MenuHomePhabricator

Make Parsoid/Zest/Remex safe for use with PHP8.4 Dom\Document
Open, Needs TriagePublic

Event Timeline

Change #1145340 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/libs/RemexHtml@master] Improve PHP 8.4 compatibility fixes

https://gerrit.wikimedia.org/r/1145340

Change #704745 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Support Dodo/PHP8.4 DOM implementations

https://gerrit.wikimedia.org/r/704745

The goal here is to move to the more standards-compliant PHP8.4 implementations as soon as WMF production is ready for them. We intend to switch to the PHP8.4 DOM classes in CI right away (https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/704745), so we don't regress.

I got some push back from PHP devs on mastodon when I suggested the PHP 8.4 implementation still had bugs we needed to work around. One issue is https://github.com/whatwg/dom/issues/849 / https://github.com/whatwg/dom/issues/769 upstream in the DOM spec: the HTML parsing spec allows a number of element and attribute names to be created which the DOM spec disallows, meaning that it is "impossible" to create a DOM-only implementation of the HTML parser. I expect this will /eventually/ be fixed upstream --- apparently at least two browser vendors are onboard ( https://github.com/whatwg/dom/pull/1079 ), and PHP would be expected to eventually follow the upstream DOM spec, but this process is going to take a while.

Other than that, it appears the PHP 8.4 parser is slightly buggy (as evidenced by Zest tests) but that's less relevant for us since we use Remex. Parsoid requires deep access to the tree builder, so we're likely to always use Remex, but it's possible some users of Remex will eventually be able to use PHP8.4's HTMLDocument::loadFromString() instead of Remex. I haven't benchmarked these against each other yet.

Anyway, this leaves the following goals:

  1. Reduce the number of 'hacks' in DOMBuilder which were specifically to work around PHP < 8.4 \DOMDocument bugs. They can be gated by a flag and then eventually bypassed at runtime when we move to PHP8.4. This will ultimately reduce complexity, at least for folks using PHP8.4+.
  2. Try to make the output of Remex better match the output of Dom\HTMLDocument::loadFromString() so that we won't break anything if we swap clients from one to the other for performance. I think we're *mostly* there, the main difference is "suppressHtmlNamespace" which actually has a corresponding Dom\HTML_NO_DEFAULT_NS option in core (although there are possible some difference in template parsing -- to be investigated). The goal here would be to be able to map a given set of Remex options to a set of PHP native options, so that "Dom\HTMLDocument::loadFromString with Dom\HTML_NO_DEFAULT_NS" yields exactly the same output (modulo bugs) as "Remex parse with 'suppressHtmlNamespace'=>true" and equivalently with the PHP default options and the Remex default options. Like I said, we're mostly there but I think we can rename some of the Remex options to make the correspondence clearer. (An example: Dom\HTMLDocument::loadFromString might handle leading newlines in <pre> differently than Remex; if so we should add a Remex option to select the "php 8.4 compatible behavior" by default and allow our existing clients to explicitly opt-in to the "current" Remex behavior for <pre> (which may or may not correspond to an earlier revision of the HTML parsing spec, this has flipped back and forth a number of times so I don't immediately know which of Remex or PHP8.4 is "right" at the current time).
  3. Encapsulate the "coerced character set" issue in a way which will be forward-compatible when this is *eventually* resolved upstream. Our current mechanism relies on a DOMException being thrown to trigger a workaround, which is good from a forward-compatibility standpoint: it is expected that eventually using these characters will Just Work and not trigger a DOMException, and our workarounds will become 'magically' unused at that point. Great. But in the meantime, we have three different workaround strategies, and they could be more clearly flagged in the DOMBuilder options:
    • (Re)throw an DOMException (hard fail, no workaround; appropriate in some cases)
    • Perform a semi-reversible transformation of element and attribute names and set a 'coerced' flag; reverse this transformation during serialization
    • Use "another" HTML parser -- not necessarily standards-compliant, could be buggy -- to create elements/attributes with "difficult" names. (This essentially generates the same output DOM we expect to eventually be able to use the DOM API to generate once https://github.com/whatwg/dom/pull/1079 is rolled out, just "today" not "tomorrow".)

There are various issues with all of these workaround methods: throwing an exception has led to production errors (T349310); performing a transformation of the names works when you use the Remex serializer, but doesn't round trip cleanly when using other serializers (like Parsoid's); and using an HTML parser to implement an HTML parser adds a lot of "needless" complexity.

Change #1145340 merged by jenkins-bot:

[mediawiki/libs/RemexHtml@master] Improve PHP 8.4 compatibility fixes

https://gerrit.wikimedia.org/r/1145340

Change #704745 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Support Dodo/PHP8.4 DOM implementations

https://gerrit.wikimedia.org/r/704745

Change #1165063 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a9

https://gerrit.wikimedia.org/r/1165063

Change #1165063 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.22.0-a9

https://gerrit.wikimedia.org/r/1165063

Change #1233229 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_44] Support Dodo/PHP8.4 DOM implementations

https://gerrit.wikimedia.org/r/1233229

Change #1233235 had a related patch set uploaded (by Reedy; author: C. Scott Ananian):

[mediawiki/services/parsoid@REL1_43] Support Dodo/PHP8.4 DOM implementations

https://gerrit.wikimedia.org/r/1233235

Change #1233229 merged by jenkins-bot:

[mediawiki/services/parsoid@REL1_44] Support Dodo/PHP8.4 DOM implementations

https://gerrit.wikimedia.org/r/1233229

Change #1233235 abandoned by C. Scott Ananian:

[mediawiki/services/parsoid@REL1_43] Support Dodo/PHP8.4 DOM implementations

Reason:

I don't think this is a good idea to backport.

https://gerrit.wikimedia.org/r/1233235