Page MenuHomePhabricator

RemexHtml DOM construction performance increases non-linearly wrt HTML size
Open, MediumPublic


See T204595#4841459 and the analysis in that comment.

There is some inefficiency in RemexHTML's DOM construction code that could be investigated and fixed.

Event Timeline

ssastry triaged this task as Medium priority.Dec 21 2018, 11:15 PM

DOMBuilder.php uses $parent->insertBefore( $node, $refNode ) which appears to call xmlAddPrevSibling in libxml which ought to be constant time (nodes are stored as a linked list). The quadratic performance makes me think that something in here is operating on the child node *array* instead of a linked list, but I can't find it (yet).

The only thing suspicious in the code is the fact that text node normalization occurs during TreeBuilder. It's possible if we scan through all the child nodes looking for text nodes every time we do a text insert we might hit some bad cases. No smoking gun though.

Profiling [[en:Barack Obama]] with xdebug seems to confirm that DOMElement->insertBefore() is the slow call; I just need to figure out how to get a C-level profile to get a little more insight into which path through is the slow one.

Part of the slowness appears to be the call to dom_reconcile_ns in node.c -- callgrind seems to confirm that the xmlSearchNsByHref call is slow. Hacking around things by changing createElementNS to createElement in DOMBuilder::createNode (which is, incidentally, something requested as a short-term workaround by T217708: Remex should offer an option to not set namespaceURI) makes DOMElement->insertBefore() disappear from the hot functions list, and seems to reduce time spent parsing [[en:Barack Obama]] from 6.8s to 4.8s on my machine.

Testing using psysh (from the zest.php directory, so we can use a helper to load HTML via Remex):

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);
=> 2592386
>>> $html10 = file_get_contents('./obama10.html'); strlen($html10);
=> 259240
>>> $html1 = file_get_contents('./obama1.html'); strlen($html1);
=> 25923

Before the change:

>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html1) && true;
=> true
Command took 0.123703 seconds on average (0.113378 median; 0.618514 total) to complete.
>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html10) && true;
=> true
Command took 0.815729 seconds on average (0.812145 median; 4.078647 total) to complete.
>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100) && true;
=> true
Command took 9.628351 seconds on average (9.455430 median; 48.141756 total) to complete.

After hacking the use of createElement rather than createElementNS:

>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html1) && true;
=> true
Command took 0.125495 seconds on average (0.115013 median; 0.627477 total) to complete.
>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html10) && true;
=> true
Command took 0.811814 seconds on average (0.791613 median; 4.059071 total) to complete.
>>> timeit -n5 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100) && true;
=> true
Command took 7.646966 seconds on average (7.478886 median; 38.234830 total) to complete.

For comparison, a test using domino is 25x faster. Using:

$ node -e "d=require('./');f=require('fs');h=f.readFileSync('./obama.html','utf-8');console.time();d.createDocument(h);console.timeEnd();"

We get:

default: 25.561ms   1%
default: 92.491ms   10%
default: 306.039ms 100%

The next hottest bit in the profile seems to be the attribute parsing code in Remex. Note that Remex uses the LazyAttributes class to avoid fully-parsing the attributes in cases where we are just filtering the input. For this reason, @ssastry's original analysis in {T204595#4841459} was probably undercounting the cost of the tokenization phase and over-counting the cost of the DOMBuilder phase.

(Note also that we're skewing sublinear here because the prefix of the document includes the <head> and infoboxes so isn't necessarily representative of the document body.)

Change 495461 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/libs/RemexHtml@master] Remove debugging code from TreeBuilder::adoptionAgency

Found another slowdown:

<cscott> huh, we're also spending a lot of time in CachingStack->dump() -- that's not right, that's pure debug code, we should never be in there (!)
<cscott> oh no, $this->trace( "AFE\n" . $afe->dump() . "STACK\n" . $stack->dump() );
<cscott>         private function trace( $msg ) {
<cscott>                 // print "[AAA] $msg\n";
<cscott>         }
<cscott> sigh
<cscott> removing the debugging code from TreeBuilder::adoptionAgency brings us down to 6.16s.  getting there! one more second shaved, five more to go.

After this change, Remex is:

  1%: Remex  101ms Domino  26ms ( 3.8x slowdown)
 10%: Remex  696ms Domino  92ms ( 7.5x slowdown)
100%: Remex 6435ms Domino 306ms (21x slowdown)

The relative # of calls appear to scale linearly between the 10% and 100% benchmarks:

Screenshot from 2019-03-09 12-41-22.png (612×1 px, 142 KB)

xdebug says that Tokenizer::handleAttribsAndClose accounts for 15.9 of the 20.6 total time units parsing [[en:Barack Obama]] now (after tweaking Remex to immediately initialize the LazyAttributes so the cost is accounted properly). That 15.9 is split about evenly between Tokenizer::consumeAttribs (6.6 units, most of this in Tokenizer::interpretAttribMatches via LazyAttributes::init()) and TreeBuilder\Dispatcher::startTag (6.5 units, most of this in DOMBuilder::insertElement).

So I think there are probably two separate problems here left to solve. Something's still slow in DOMBuilder::insertElement (the native insertBefore still accounts for 2.7 of this, and 1.1 is in DOMBuilder::createNode itself), and something's slow in interpretAttribMatches (1.3/0.9/0.3 is in handleCharRefs/handleNulls/handleAsciiErrors, but 2.7 is inside interpretAttribMatches itself).

It looks like Remex is also missing the SIMPLETAG optimization added in domino which would avoid calling interpretAttribMatches entirely; it's possible adding the SIMPLETAG/SIMPLEATTR/emitCharsWhile/getMatchingChars optimizations from domino could help, especially if they can avoid calls to interpretAttribMatches. (Some of these might be already present.)

Change 495461 merged by jenkins-bot:
[mediawiki/libs/RemexHtml@master] Remove debugging code from TreeBuilder::adoptionAgency

Some notes copied from chat:

The preg_match in attribute parsing allocates 200M (!) for a 2M input HTML file.
Node#insertBefore accounts for about 50% of the total time as well, even though I've combed through the C implementation and haven't found anything particularly evil
Basically the performance issues haven't been one big bug but a bunch of things that seem to contribute about 2s each to the (originally) 9s running time of [[en:Barack_Obama]].
And so far I've found and fixed three or so issues (debugging code in TreeBuilder, HTML namespace reconciliation, simple tag/attribute optimization) and think there are two more left
I worked on character entity decoding yesterday and got it so that's effectively free (.1s difference when you disable encoding, on a ~4s run time). Character decoding was the most expensive callee of the attribute parsing code, but now the 2s in attribute decoding that's left is all spent in "self", none of the callees contribute significantly.
2s in attribute decoding, 2s in Node#insertBefore, 4.<something> total run time
That's about half in tokenizer and half in dom building; I think the 2s attribute decoding cost should show up in remex-as-tidy costs, too, although the lazy attribute mechanism might disguise its source.

Preg performance seems to degrade once the size of the matched string increases. Copying the result string into the match array seems to be very slow, for reasons I don't understand. A solution might be to break the regexp in half so that once we see the initial quote on the attribute we stop the match and use strpos to find the close quote. That's roughly how tag parsing works (we scan past the bits before the first <, we don't include them in the match), but character entity decode currently includes the prefix before the & in the match; I've got a match to change that to an initial strpos scan, which might help w/ bulk character entity decode in Text nodes. (strpos is still 2-3x faster than preg_match for simple "find a character or substring" tasks.)

@subbu mentioned that @Tgr filed T218183: Audit uses of PHP DOM in Wikimedia software as a place to document some of the DOM performance (and correctness) issues.

A performance update is at T204595#5024206

We seem to be allocating a lot of memory during parsing, probably due to the fact that preg_match always allocates and returns a copy of the string for the entire matched region as well as for all captured patterns. For attribute parsing in particular, this can grow quite large. I filed upstream as a wish-list for improving performance in the future by using offsets instead of copying the substring.