Page MenuHomePhabricator

Audit uses of PHP DOM in Wikimedia software
Open, MediumPublic

Description

PHP's DOM implementation has various problems and does not support HTML5 well. The PHP port of Parsoid involves finding replacements or workarounds (see T215000: Fill gaps in PHP DOM's functionality, T217867: Port domino (or another spec-compliant DOM library) to PHP). We should make a list of all other places that make use of the PHP DOM and see if they are better served by the new set of tools.

Event Timeline

Tgr created this task.Mar 13 2019, 6:26 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 13 2019, 6:26 AM
Tgr added a comment.Mar 13 2019, 6:28 AM

T217360: Replace libxml/xpath in HtmlFormatter with Remex/zest already has its own bug.
CommonsMetadata has some very limited HTML DOM navigation, too.

I'm finding performance issues with the PHP DOM implementation in T212543: RemexHtml DOM construction performance increases non-linearly wrt HTML size as well; we could watch for those in the audit as well. (In particular, setting the namespace on a node in the PHP DOM seems to trigger a nonlinear slowdown.)

Tgr added a comment.Mar 13 2019, 9:24 PM

Here's a search for all Wikimedia-deployed code using PHP DOM classes. At a glance:

  • wmde/php-vuejs-templating uses extended HTML for templating and parses it via DOMDocument (with a comment saying TODO html5 tags can fail parsing)
  • Flow, already discussed elsewhere
  • the help API module does some lightweight URL rewriting
  • ResourceLoader does SVG parsing, not sure if that's relevant here
  • some of things in core (LocalisationCache, preprocessor, import) and some extensions (DonationInterface, GWToolset, Translate, RSS, CodeReview, FileImporter, Timeline...) use it for XML parsing, probably not relevant here
  • parser test framework (there's a comment on how that's problematic)
  • HTMLFormatter (and MobileFormatter in MobileFrontend)
  • CommonsMetadata for description template parsing, as mentioned above
  • ImageMap re-parses the parser output for post-processing
  • Wikibase uses Remex for URL post-processing
  • TextExtracts for HTML extracts
  • ZeroBanner for... whatever, it's getting axed soon anyway
ssastry triaged this task as Medium priority.Mar 13 2019, 9:46 PM
ssastry added a project: TechCom.
mobrovac moved this task from Inbox to Watching on the TechCom board.Mar 13 2019, 9:59 PM

We could use this in the PageLayoutFormatter stuff that MCR promises but hasn't yet delivered (see T209878). Right now in WikibaseMediaInfo we're having to do a series of regexs (:-().

daniel added a subscriber: daniel.EditedMar 14 2019, 7:20 AM

We could use this in the PageLayoutFormatter stuff that MCR promises but hasn't yet delivered (see T209878). Right now in WikibaseMediaInfo we're having to do a series of regexs (:-().

What do you think the MCR page layout mechanism should be used for? In my mind, that exists on a different level from DOM processing. It may indeed use DOM processing, but hopefully, it would really just compose things, not munge them. I'm thinking of each slot exposing multiple chunks of HTML that can then be composed with Mustache - but this is getting off topic.

We could use this in the PageLayoutFormatter stuff that MCR promises but hasn't yet delivered (see T209878). Right now in WikibaseMediaInfo we're having to do a series of regexs (:-().

What do you think the MCR page layout mechanism should be used for? In my mind, that exists on a different level from DOM processing. It may indeed use DOM processing, but hopefully, it would really just compose things, not munge them. I'm thinking of each slot exposing multiple chunks of HTML that can then be composed with Mustache - but this is getting off topic.

I was assuming that (some) users would want sub-block DOM processing (insert this div into the main slot's content block; move the span with this key from the main slot to the this slot; etc.), and so would want to use PHP DOM. But yes, the details are off-topic.

In general I would be happy to see an interface that allowed components to pass around DOM subtrees instead of strings, stringifying the tree only where needed for a legacy API. Then "composition" is (eventually) just subtree assembly, and we don't have to worry about poorly-constructed components leaking open tags into the rest of the content...

Tgr added a comment.Mar 14 2019, 11:21 PM

Subbu mentioned on IRC that some projects ran into difficulties using DOMDocument parsing with UTF-8; another thing a replacement could possibly fix.

Yeah, there's a more-or-less standard-but-ugly workaround that involves using mb_encode to replace everything above U+007F with an HTML entity: https://github.com/wikimedia/html-formatter/blob/5e33e3bbb327b3e0d685cc595837ccb024b72f57/src/HtmlFormatter.php#L71