Page MenuHomePhabricator

Evaluate and document performance of RemexHtml vs Domino
Closed, ResolvedPublic

Description

In order to get a handle on the performance implications of porting Parsoid to PHP, we need to evaluate the different components of Parsoid.
One of these components is the tree builder that converts a string to a HTML document.
In PHP we will use RemexHtml in place of Domino used in Javascript.
Initial experiments in Feb 2018 indicated that the performance is comparable, but those tests were minimalistic and we didn't document them.

This task is to do some of that evaluation a bit more systematically and document the results here.

Event Timeline

ssastry created this task.Sep 17 2018, 8:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 17 2018, 8:09 PM
ssastry triaged this task as Medium priority.Sep 17 2018, 8:09 PM
ssastry raised the priority of this task from Medium to High.Sep 24 2018, 8:23 PM
ssastry added a comment.EditedSep 24 2018, 9:51 PM

Preliminary numbers on my personal laptop.

-------------------------------------------------------------------------------
FILE                         SIZE      Remex    node.js    Rust     Remex-NODOM
                                      PHP-DOM   Domino   html5ever     ----
-------------------------------------------------------------------------------
/tmp/bo.html                2420589     2.51     0.86      1.67        0.45
/tmp/bo.nopb.html           1873853     2.54     0.86      1.38        0.44
/tmp/bo.html2html           3128224     3.29     1.14      3.86        0.44
/tmp/bo.nopb.html2html      2077025     2.71     0.92      2.04        0.43
/tmp/hampi.html              613901     0.39     0.57      0.44        0.13
/tmp/hampi.nopb.html         492231     0.36     0.53      0.46        0.14
/tmp/hampi.html2html         786349     0.59     0.63      0.98        0.14
/tmp/hampi.nopb.html2html    562395     0.47     0.57      0.63        0.14
/tmp/hospet.html              37293     0.02     0.40      0.03        0.03
/tmp/hospet.html2html         48396     0.04     0.39      0.07        0.02
-------------------------------------------------------------------------------

NOTES:
* The HTML file used to test HTML5 treebuilder / DOM library performance
  is generated by Parsoid's parse.js script.

* All times reported are user component of /usr/bin/time

* html2html = parse the html file and dump output
  Parsoid's html serialization tries to minimize entity encoding of quotes
  But, during html2html, quotes (in data-parsoid / data-mw) get encoded
  and bloat the HTML size
  
* nopb = no pagebundle; the original html generated via --pbout arg to
  parse.js. This removes inline data-parsoid attributes from the HTML.

* For Remex, results are generated via tidy(text) and tidyViaDOM(text)
  functions in bin/test.php of the Remex codebase

Observations

  1. Domino / node.js seems to have a baseline overhead for all documents (even on a small 1K file, it has a 0.35s time).
  2. Domino has more predictable performance across document sizes, presumably because of better handling of attributes and entities in attributes
  3. The last column indicates that when there is no DOM construction involved, its SAX-based approach to HTML serialization is quite performant.
  4. Overall, Remex treebuilder + build DOM + HTML5 serialize performance is good enough / better than node.js + Domino in many cases.
  5. But, we still need to figure out if the slowdown on larger DOMs with inline data-mw and optional inline data-parsoid is one or more of (a) PHP language / runtime (b) DOM library (c) Remex
ssastry claimed this task.Sep 24 2018, 9:59 PM
cscott added a subscriber: cscott.Sep 25 2018, 3:56 PM

Could you post your test scripts somewhere? To be fair we should probably factor out process startup and file read times out of the measurements (the 350ms overhead you're measuring). It seems like we should dig into the slow Remex performance on large documents more, though, to figure out if there are some O(N^2) tree-mutation algorithms we need to kill, and if so figure out how hard they will be to fix (ie, are the bugs in Remex, in the PHP DOM extension).

Could you post your test scripts somewhere? To be fair we should probably factor out process startup and file read times out of the measurements (the 350ms overhead you're measuring). It seems like we should dig into the slow Remex performance on large documents more, though, to figure out if there are some O(N^2) tree-mutation algorithms we need to kill, and if so figure out how hard they will be to fix (ie, are the bugs in Remex, in the PHP DOM extension).

This was just a dump of the crude preliminary numbers that I had first collected back in Jan to see what stands out.
The next step is to actually do this more systematically for just node.js+domino and php+remex. That will be more revealing.
For that, I'll create a directory in the parsoid directory for benchmarking with test html files and scripts as well.

ssastry added a comment.EditedDec 21 2018, 11:07 PM

Here are better raw numbers on different pages (and where they stack up wrt Parsoid HTML sizes relative to all wt parses on the Wikimedia cluster): Skating (< p50); Hospet (p50 - p95), Hampi (> p95), Berlin (p99), Barack_Obama (> p99) .

Using Skating HTML size as the base unit, the relative HTML sizes are as follows: Skating = 1; Hospet = 3; Hampi = 50; Berlin = 100; Barack_Obama = 200.

For RemexHTML, here are some observations:

  • For the treebuilder, times are scaling a tad sub-linearly with output size (Skating TB time = ~1.7ms; Barack_Obama TB time = ~300ms which is ~175x)
  • For DOM building, times are scaling super-linearly with output size (Skating DOM time = ~0.9ms; Barack_Obama DOM time = ~1740ms which is ~2000x)
  • For HTML serialization, times are scaling a tad super-linearly with output size (Skating Serialization time = ~1.4ms; Barack_Obama Serialization time = ~320ms which is ~230x)
  • So, DOM construction is the main bottleneck in RemexHTML and there is some performance inefficiency that could potentially be squashed.

For Domino

  • Both TB + DOM as well as Serialization time are scaling sub-linearly wrt HTML size and are probably also benefiting from v8 JIT optimizations.

So, TLDR is that till about p95, RemexHTML performs about as well as Domino, but as the HTML size goes up beyond that, RemexHTML performance degrades, primarily around DOM construction and which could potentially be tackled separately.

I'll publish the scripts and commands I used for this test in January.

node ./bm.domino.js /tmp/Skating.html
Iteration 0:  Parse (TB + DOM)   :   28.490;  Outer HTML  :   6.620;  XML serialize :   5.386
Iteration 1:  Parse (TB + DOM)   :    8.471;  Outer HTML  :   1.417;  XML serialize :   1.845
Iteration 2:  Parse (TB + DOM)   :    8.354;  Outer HTML  :   1.034;  XML serialize :   1.679
Iteration 3:  Parse (TB + DOM)   :    5.330;  Outer HTML  :   1.059;  XML serialize :   3.802
Iteration 4:  Parse (TB + DOM)   :    4.714;  Outer HTML  :   2.029;  XML serialize :   2.772
Iteration 5:  Parse (TB + DOM)   :    3.455;  Outer HTML  :   0.899;  XML serialize :   1.550
Iteration 6:  Parse (TB + DOM)   :    4.033;  Outer HTML  :   0.919;  XML serialize :   1.512
Iteration 7:  Parse (TB + DOM)   :    5.865;  Outer HTML  :   0.883;  XML serialize :   1.325
Iteration 8:  Parse (TB + DOM)   :    4.724;  Outer HTML  :   0.724;  XML serialize :   1.154
Iteration 9:  Parse (TB + DOM)   :    2.912;  Outer HTML  :   0.661;  XML serialize :   1.100
node ./bm.domino.js /tmp/Hospet.html      ;                        ;           
Iteration 0:  Parse (TB + DOM)   :   45.057;  Outer HTML  :   7.560;  XML serialize :   7.628
Iteration 1:  Parse (TB + DOM)   :   23.617;  Outer HTML  :   2.952;  XML serialize :   4.786
Iteration 2:  Parse (TB + DOM)   :   14.486;  Outer HTML  :   6.568;  XML serialize :   5.680
Iteration 3:  Parse (TB + DOM)   :    9.025;  Outer HTML  :   2.931;  XML serialize :   4.047
Iteration 4:  Parse (TB + DOM)   :    8.325;  Outer HTML  :   1.987;  XML serialize :   3.252
Iteration 5:  Parse (TB + DOM)   :   11.106;  Outer HTML  :   1.852;  XML serialize :   2.968
Iteration 6:  Parse (TB + DOM)   :    7.442;  Outer HTML  :   2.146;  XML serialize :   3.258
Iteration 7:  Parse (TB + DOM)   :    7.562;  Outer HTML  :   4.942;  XML serialize :   2.901
Iteration 8:  Parse (TB + DOM)   :   12.990;  Outer HTML  :   2.495;  XML serialize :   3.442
Iteration 9:  Parse (TB + DOM)   :   10.012;  Outer HTML  :   1.857;  XML serialize :   6.806
node ./bm.domino.js /tmp/Hampi.html       ;                  
Iteration 0:  Parse (TB + DOM)   :  144.423;  Outer HTML  :  43.292;  XML serialize :  70.509
Iteration 1:  Parse (TB + DOM)   :  112.840;  Outer HTML  :  31.632;  XML serialize :  52.093
Iteration 2:  Parse (TB + DOM)   :   92.748;  Outer HTML  :  41.794;  XML serialize :  46.466
Iteration 3:  Parse (TB + DOM)   :   70.309;  Outer HTML  :  41.553;  XML serialize :  42.913
Iteration 4:  Parse (TB + DOM)   :   73.342;  Outer HTML  :  26.583;  XML serialize :  47.678
Iteration 5:  Parse (TB + DOM)   :   63.704;  Outer HTML  :  40.255;  XML serialize :  48.051
Iteration 6:  Parse (TB + DOM)   :   99.134;  Outer HTML  :  27.847;  XML serialize :  40.734
Iteration 7:  Parse (TB + DOM)   :   64.083;  Outer HTML  :  40.394;  XML serialize :  44.302
Iteration 8:  Parse (TB + DOM)   :   65.859;  Outer HTML  :  26.992;  XML serialize :  45.843
Iteration 9:  Parse (TB + DOM)   :   63.347;  Outer HTML  :  40.355;  XML serialize :  45.641
node ./bm.domino.js /tmp/Berlin.html      ;                  
Iteration 0:  Parse (TB + DOM)   :  254.249;  Outer HTML  :  84.403;  XML serialize : 102.310
Iteration 1:  Parse (TB + DOM)   :  266.656;  Outer HTML  :  89.249;  XML serialize : 105.749
Iteration 2:  Parse (TB + DOM)   :  195.958;  Outer HTML  :  66.095;  XML serialize :  89.874
Iteration 3:  Parse (TB + DOM)   :  165.916;  Outer HTML  :  65.329;  XML serialize :  91.534
Iteration 4:  Parse (TB + DOM)   :  163.639;  Outer HTML  :  65.603;  XML serialize :  87.553
Iteration 5:  Parse (TB + DOM)   :  164.836;  Outer HTML  :  62.566;  XML serialize :  85.451
Iteration 6:  Parse (TB + DOM)   :  171.222;  Outer HTML  :  67.242;  XML serialize :  90.883
Iteration 7:  Parse (TB + DOM)   :  163.140;  Outer HTML  :  58.912;  XML serialize : 122.590
Iteration 8:  Parse (TB + DOM)   :  184.547;  Outer HTML  :  62.085;  XML serialize :  88.499
Iteration 9:  Parse (TB + DOM)   :  157.829;  Outer HTML  :  63.461;  XML serialize :  90.793
node ./bm.domino.js /tmp/Barack_Obama.html;                  
Iteration 0:  Parse (TB + DOM)   :  336.234;  Outer HTML  : 134.179;  XML serialize : 208.348
Iteration 1:  Parse (TB + DOM)   :  264.241;  Outer HTML  : 120.866;  XML serialize : 166.174
Iteration 2:  Parse (TB + DOM)   :  232.451;  Outer HTML  : 111.567;  XML serialize : 158.346
Iteration 3:  Parse (TB + DOM)   :  227.688;  Outer HTML  : 136.758;  XML serialize : 179.521
Iteration 4:  Parse (TB + DOM)   :  250.381;  Outer HTML  : 115.336;  XML serialize : 153.879
Iteration 5:  Parse (TB + DOM)   :  230.929;  Outer HTML  : 112.513;  XML serialize : 161.325
Iteration 6:  Parse (TB + DOM)   :  228.554;  Outer HTML  : 148.179;  XML serialize : 163.246
Iteration 7:  Parse (TB + DOM)   :  256.993;  Outer HTML  : 104.370;  XML serialize : 154.956
Iteration 8:  Parse (TB + DOM)   :  235.402;  Outer HTML  : 116.774;  XML serialize : 200.426
Iteration 9:  Parse (TB + DOM)   :  238.816;  Outer HTML  : 121.847;  XML serialize : 164.083
----------------------------------------------------------------------------
php ../software/RemexHtml/bin/bm.php /tmp/Skating.html
---- TB ----
Iteration 0:    9.948
Iteration 1:    1.736
Iteration 2:    1.742
Iteration 3:    1.713
Iteration 4:    1.737
Iteration 5:    1.709
Iteration 6:    1.737
Iteration 7:    1.736
Iteration 8:    1.857
Iteration 9:    1.745
---- TB + DOM ----
Iteration 0:    3.161
Iteration 1:    2.623
Iteration 2:    2.620
Iteration 3:    2.622
Iteration 4:    2.672
Iteration 5:    2.614
Iteration 6:    2.632
Iteration 7:    2.598
Iteration 8:    2.629
Iteration 9:    2.668
---- TB + DOM + serialize ----
Iteration 0:    7.770
Iteration 1:    4.060
Iteration 2:    4.089
Iteration 3:    4.117
Iteration 4:    4.257
Iteration 5:    4.058
Iteration 6:    4.074
Iteration 7:    4.046
Iteration 8:    4.075
Iteration 9:    4.120
php ../software/RemexHtml/bin/bm.php /tmp/Hospet.html
---- Tree builder ----
Iteration 0:   13.288
Iteration 1:    5.149
Iteration 2:    5.173
Iteration 3:    5.117
Iteration 4:    5.223
Iteration 5:    5.110
Iteration 6:    5.169
Iteration 7:    5.166
Iteration 8:    5.115
Iteration 9:    5.177
---- TB + DOM ----
Iteration 0:    9.289
Iteration 1:    8.682
Iteration 2:    8.736
Iteration 3:    8.773
Iteration 4:    8.786
Iteration 5:    8.886
Iteration 6:    8.770
Iteration 7:    9.865
Iteration 8:    9.454
Iteration 9:    8.996
---- TB + DOM + serialize ----
Iteration 0:   17.479
Iteration 1:   13.861
Iteration 2:   13.927
Iteration 3:   13.909
Iteration 4:   13.966
Iteration 5:   13.887
Iteration 6:   13.790
Iteration 7:   14.157
Iteration 8:   13.784
Iteration 9:   14.104
php ../software/RemexHtml/bin/bm.php /tmp/Hampi.html
---- Tree builder ----
Iteration 0:   94.538
Iteration 1:   85.690
Iteration 2:   85.734
Iteration 3:   88.332
Iteration 4:   85.093
Iteration 5:   85.343
Iteration 6:   87.033
Iteration 7:   86.536
Iteration 8:   85.986
Iteration 9:   85.376
---- TB + DOM ----
Iteration 0:  273.082
Iteration 1:  275.047
Iteration 2:  270.606
Iteration 3:  269.747
Iteration 4:  269.226
Iteration 5:  275.236
Iteration 6:  270.725
Iteration 7:  267.984
Iteration 8:  270.838
Iteration 9:  270.996
---- TB + DOM + serialize ----
Iteration 0:  354.629
Iteration 1:  353.232
Iteration 2:  356.571
Iteration 3:  356.898
Iteration 4:  353.466
Iteration 5:  354.028
Iteration 6:  359.573
Iteration 7:  354.453
Iteration 8:  352.551
Iteration 9:  356.280
php ../software/RemexHtml/bin/bm.php /tmp/Berlin.html
---- Tree builder ----
Iteration 0:  173.922
Iteration 1:  165.554
Iteration 2:  167.687
Iteration 3:  166.898
Iteration 4:  164.520
Iteration 5:  166.564
Iteration 6:  165.935
Iteration 7:  166.094
Iteration 8:  167.654
Iteration 9:  168.171
---- TB + DOM ----
Iteration 0:  786.813
Iteration 1:  780.408
Iteration 2:  780.130
Iteration 3:  780.570
Iteration 4:  790.163
Iteration 5:  785.084
Iteration 6:  782.064
Iteration 7:  784.035
Iteration 8:  778.896
Iteration 9:  788.631
---- TB + DOM + serialize ----
Iteration 0:  941.465
Iteration 1:  945.864
Iteration 2:  947.082
Iteration 3:  947.132
Iteration 4:  945.645
Iteration 5:  950.995
Iteration 6:  957.442
Iteration 7:  947.041
Iteration 8:  942.517
Iteration 9:  944.852
php ../software/RemexHtml/bin/bm.php /tmp/Barack_Obama.html
---- Tree builder ----
Iteration 0:  304.975
Iteration 1:  302.936
Iteration 2:  296.979
Iteration 3:  296.704
Iteration 4:  297.346
Iteration 5:  297.564
Iteration 6:  300.270
Iteration 7:  296.230
Iteration 8:  297.780
Iteration 9:  297.415
---- TB + DOM ----
Iteration 0: 2032.951
Iteration 1: 2023.776
Iteration 2: 2031.033
Iteration 3: 2025.496
Iteration 4: 2024.857
Iteration 5: 2032.032
Iteration 6: 2082.453
Iteration 7: 2026.822
Iteration 8: 2038.027
Iteration 9: 2040.230
---- TB + DOM + serialize ----
Iteration 0: 2352.874
Iteration 1: 2362.179
Iteration 2: 2458.168
Iteration 3: 2418.236
Iteration 4: 2407.460
Iteration 5: 2357.022
Iteration 6: 2358.133
Iteration 7: 2346.436
Iteration 8: 2360.397
Iteration 9: 2367.187
ssastry closed this task as Resolved.Dec 21 2018, 11:44 PM
cscott added a comment.EditedMar 14 2019, 4:18 AM

Could you post your test scripts somewhere? To be fair we should probably factor out process startup and file read times out of the measurements (the 350ms overhead you're measuring). It seems like we should dig into the slow Remex performance on large documents more, though, to figure out if there are some O(N^2) tree-mutation algorithms we need to kill, and if so figure out how hard they will be to fix (ie, are the bugs in Remex, in the PHP DOM extension).

Could you publish your bm,php script? I'm seeing much worse performance than you for Remex on (in theory) the same benchmark: DOM building on the Parsoid output for [[en:Barack Obama]].

My test file is test.php, attached, run from a checkout of zest.php (so we can use Zest's "loadHTML" helper). In addition I need to hack ZestTest::parseHtml to pass through the 'suppressHtmlNamespace' option to Remex's DOMBuilder (this only works in git master of remex, not in a released version). Patch for that is attached, as well as my obama.html file.

I'm running on PHP 7.3.2-3 from debian/testing.



You can also reproduce my numbers with psysh:

cananian@skiffserv:~/Projects/Wikimedia/zest.php$ psysh
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);                       => 2592386
>>> timeit -n10 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100, [ 'suppressHtmlNamespace' => true, 'ignoreErrors' => true ]) && true;
=> true
Command took 5.166580 seconds on average (4.951046 median; 51.665798 total) to complete.
>>>

Ah -- *and I have xdebug enabled* (naturally, because I'm trying to profile the runs). Of course:

$ sudo phpenmod -s cli xdebug
$ time php test.php

real	0m5.245s
user	0m3.396s
sys	0m1.844s
$ sudo phpdismod -s cli xdebug
$ time php test.php

real	0m0.545s
user	0m0.480s
sys	0m0.068s
$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';                                                        => Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);                       => 2592386
>>> timeit -n10 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100, [ 'suppressHtmlNamespace' => true, 'ignoreErrors' => true ]) && true;
=> true
Command took 0.452644 seconds on average (0.429866 median; 4.526438 total) to complete.
>>>

Sigh. Hopefully it was a linear slowdown? I'll have to re-run my benchmarks with xdebug disabled and verify that nothing I did was actually a pessimization (I don't *think* so), but I'm going to have to switch profilers before digging into this further.

Testing with the following script (from the zest library home dir) with xdebug off:

$ psysh 
Psy Shell v0.9.9 (PHP 7.3.2-3 — cli) by Justin Hileman
>>> require 'vendor/autoload.php';
=> Composer\Autoload\ClassLoader {#2}
>>> require('./tests/ZestTest.php');
=> 1
>>> $html100 = file_get_contents('./obama.html'); strlen($html100);                       => 2592386
>>> timeit -n10 \Wikimedia\Zest\Tests\ZestTest::parseHTML($html100, [ 'suppressHtmlNamespace' => true, 'ignoreErrors' => true ]) && true;
=> true
Command took 0.389295 seconds on average (0.376101 median; 3.892954 total) to complete.

Median times in seconds on [[en:Barack Obama]] before/after each patch in recent remex history:

2.981688
pick 59806f7 Remove debugging code from TreeBuilder::adoptionAgency
2.894746
pick ea60ca7 Provide an option to suppress namespace for HTML elements
0.678349
pick 1755373 Optimize parsing of "simple tags" (no attributes, no funny characters)
0.660913
pick 4222694 Optimize handling of "simple" attribute name/values
0.580422
pick 213e60b Efficiently handle well-behaved character references
0.476107
pick 07e0276 Study the preprocessing regexp
0.476089
pick aada21e Free attribute data after it has been parsed
0.470122
pick 85eafbd Fix PlainAttributes#offsetGet / PlainAttributes#getIterator
0.470078
pick 3228312 Optimize handling of "simple" data regions
0.425061
pick 02177d3 DRY out the character reference/null/error handling
0.421703

The suppressHtmlNamespace thing seems to be the biggest bottleneck fixed; that's the PHP performance issue in dom_reconcile_ns. But the other patches combined for another ~40% improvement, from 0.68s to 0.42s .

Comparing latest Remex to domino, as benchmarked in T212543#5012929 (divide the raw times reported by the 50 iterations):

$ node -e "d=require('./');f=require('fs');h=f.readFileSync('./obama100.html','utf-8');console.time();for(var i=0;i<50;i++) d.createDocument(h);console.timeEnd();"

Results:

Prefix Bytes    Domino    Remex     Slowdown
  1%     25923    5.495ms   7.307ms  33%
 10%    259240   26.960ms  55.924ms 107%
100%   2592386  198.842ms 431.375ms 117%

The nonlinear slowdown seems to be gone, based on the 10%/100% numbers. The 1% prefix document is mostly infobox, which might indicate that the 2x slowdown is in how the bulk text in the body of the article is processed.

Looks like our numbers are now more in sync after you disabled xdebug. But, here is the script I used.