Page MenuHomePhabricator

PHP's getElementsByTagName is slow prior to PHP 8.3
Open, Needs TriagePublic

Description

T317070 and T215000 both dealt with non-linear performance of getElementsByTagName in PHP DOM's implementation.

The issue was reported upstream on May 24, 2023 and fixed in the master branch by Niels Dossche nine days later.

As a temporary workaround while we wait for PHP 8.3 to be released and deployed, Parsoid's DOMCompat::getElementsByTagName may be used instead. For projects that don't have Parsoid as a dependency, an equivalent XPath query may be used as a workaround.

Code search reveals Flow, DonationsInterface, Translate, GWToolset, TextExtracts all use this method (https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=)

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedWangombe
OpenNone
ResolvedKrinkle
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedLucas_Werkmeister_WMDE
ResolvedNone
ResolvedJdforrester-WMF
ResolvedDaimona
ResolvedJdforrester-WMF
DeclinedNone
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
Resolvedcscott
ResolvedScott_French
DuplicatePRODUCTION ERRORNone
ResolvedPRODUCTION ERRORMichael
ResolvedPRODUCTION ERRORMichael
ResolvedMichael
DuplicatePRODUCTION ERRORNone
ResolvedTgr
ResolvedNone
ResolvedDAlangi_WMF
ResolvedTgr
ResolvedDAlangi_WMF
ResolvedTgr
ResolvedTgr
ResolvedAtieno
OpenNone
Resolvedbrouberol
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedScott_French
ResolvedKrinkle
ResolvedKrinkle
ResolvedScott_French
ResolvedKrinkle
ResolvedTgr
ResolvedScott_French

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The uses in Translate are a maintenance script and Xliff file format handler which is currently unused: https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=&repos=Extension:Translate

What replacements are present in MediaWiki core version 1.37 and above?

Jdforrester-WMF subscribed.

The uses in Translate are a maintenance script and Xliff file format handler which is currently unused: https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=&repos=Extension:Translate

What replacements are present in MediaWiki core version 1.37 and above?

The mentioned Zest::find is present in 1.37 (indeed, since 1.35.0-wmf.20):

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/vendor/+/refs/heads/REL1_37/wikimedia/zest-css/src/Zest.php#41

The easy fix is to replace $node->getElementsByTagName( $tag ) with Zest::getElementsByTagName( $node, $tag ) (which unlike find has identical signature).

Change 831201 had a related patch set uploaded (by Catrope; author: Catrope):

[mediawiki/core@master] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/831201

Change 831201 merged by jenkins-bot:

[mediawiki/core@master] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/831201

It looks like it may be a regression from 3084e72ef1053b8ead468e66baf55d5256e7e9af. getElementsByTagName() works by returning a special DOMNodeList which finds the next matching tag at each loop iteration. Before that commit, the search began at the previously found node. After that commit, it uses integer offsets. The search begins at the root node every time, iterating through the whole document until the stored number of elements has been passed.

Niels Dossche fixed it in #11330 and so the issue will be fully rectified once we migrate to PHP 8.3. I will update the task description accordingly.

tstarling renamed this task from Migrate code away from use of PHPDOM's getElementsByTagName which has non-linear perf characteristics to PHP's getElementsByTagName is slow prior to PHP 8.3.Jun 3 2023, 5:19 AM
tstarling updated the task description. (Show Details)

Note that the fix only works when iterating through the results without modifying the DOM. If any DOM modifications are made during iteration, it reverts back to the old n² algorithm.

The fix is to use iterator_to_array() first, or otherwise convert the live collection to a normal array, and iterate over the result of that. (This is the same as how things work in JavaScript in web browsers, where you'll see Array.from() used in the same way.)

The fix is to use iterator_to_array() first. […] This is the same as how things work in JavaScript in web browsers, where you'll see Array.from() used in the same way.

Good catch. I believe these tend to be used via an abstraction layer in JavaScript, though. For us, that's generally jQuery or document.querySelectorAll. Both of these return non-live collections that are thus already "fixed". Use of live collections as returned by getElementsByClassName or getElementsByTagName is fairly rare in "normal" JavaScript code, but may be used by libraries (such as jQuery), which indeed then iterate it before returning to avoid triggering expensive property accessors.

For PHP, I guess Zest counts as our abstraction layer. Although it's a bit more common for php-dom to be used directly, which, given that PHP lacks a native querySelector, means use of getElementsByTagName is very likely in practice.

Change 937477 had a related patch set uploaded (by Reedy; author: Catrope):

[mediawiki/core@REL1_39] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/937477

Change 937477 merged by jenkins-bot:

[mediawiki/core@REL1_39] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/937477

This is about an issue in PHP < 8.3, affecting versions PHP 8.1 and PHP 8.2 (of versions we support today, and until recently PHP 7.4).

An upstream fix landed in PHP 8.3. But we also landed a workaround in our own code to use Zest which is faster than the native code in PHP < 8.3, but presumably not as faster as the native code in PHP 8.3+.

Change 831201 merged by jenkins-bot:

[mediawiki/core@master] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/831201

The above patch landed in 2022 thus resolving the performance issue reported in this task.

I suppose what's left for this ticket, or possibly for a future task, is to undo the above workaround and leverage the native version once again – which we can do either after MediaWiki drops support for PHP 8.2 (T358667: Drop PHP 8.2 support from MediaWiki), or if we want to trade for faster code in WMF prod sooner, after WMF completes the upgrade to PHP 8.3 (T360995: Migrate Wikimedia production from PHP 8.1 to PHP 8.3).

Does that sound right? If so, I suggest we associate one of those as sub task (or resolve and file a new task for the follow-up, and associate that one).