Page MenuHomePhabricator

Timeout parsing largish page
Open, MediumPublic

Event Timeline

ssastry triaged this task as Medium priority.Oct 2 2018, 9:14 PM

For https://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States, there is a lot of identical template calls. Parsoid should probably cache output of templates in case they are reused on the page. We should check how the PHP parser treats this problem. For example, if there are multiple occurrences of {{#time:U|now}} on a page, will they all report the same time? Or, will this only happen if that is wrapped in a template?

For https://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States, there is a lot of identical template calls. Parsoid should probably cache output of templates in case they are reused on the page. We should check how the PHP parser treats this problem. For example, if there are multiple occurrences of {{#time:U|now}} on a page, will they all report the same time? Or, will this only happen if that is wrapped in a template?

We'll revisit this after a port to PHP. We might be able to reuse the existing caching code instead of redoing it in parsoid.

ssastry changed the task status from Open to Stalled.Oct 16 2018, 5:19 PM

Wrong diagnosis. The clerks page above fails because of hitting our existing resource limits.

If I bump the limits as follows, the page parses, albeit with a long parse time. But, caching will be useful in addressing the parse time in that scenario.

                        // Max table cells per page
-                       maxTableCells: 30000,
+                       maxTableCells: 100000,
 
                        // Max transclusions per page
-                       maxTransclusions: 10000,
+                       maxTransclusions: 50000,

Wrong diagnosis

It may have been true prior to T207093 / T187848 when you filed this (well, it still exceeded resources but that was suppressed giving it time to timeout).

But note that the page mostly gets tokenized synchronously, like in T143378, since the majority of the content is wrapped in an <onlyinclude>.

Wrong diagnosis

It may have been true prior to T207093 / T187848 when you filed this (well, it still exceeded resources but that was suppressed giving it time to timeout).

Ah ok .. I didn't remember if I had tested it before I diagnosed that above or not. But, here is some interesting time profile.
I copied the wt of the page to a file. I parsed once with the <onlyinclude> as in the source wikitext and once by stripping it out.

Snippet of time profile with <onlyinclude>:

Recorded times (in ms) for various parse components

                   TOTAL PARSE TIME:   49597
                      TOTAL IO TIME:  130839
                  TOTAL CPU(?) TIME:   32491

CAVEATS:
* Multiple IO requests overlap with each other.
* CPU and IO activity overlap with each other.
--------------------------------------------------------------------------------
                       BatchRequest:  129902; count:     60; per-instance: 2165.03333
                      ConfigRequest:     637; count:      1; per-instance: 637       
                   ParamInfoRequest:     300; count:      1; per-instance: 300       

                 Total API requests:      62
         # non-batched API requests:       4
                          # batches:      58
          # API requests in batches:    2563
-------------------------------------------------------------------------------------
                                PEG:   14354
                      DOMPasses:TOP:    7296
                  HTML5 TreeBuilder:    2361
          AttributeExpander:onToken:    2124; count: 231982; per-instance: 0.00916   
                            SyncTTM:    1550
                 AsyncTTM (Partial):    1451
                  Setup Environment:    1029
         WikiLinkHandler:onWikiLink:     653; count:  11668; per-instance: 0.05597   
             SanitizerHandler:onAny:     363; count: 122608; per-instance: 0.00296   
         TemplateHandler:onTemplate:     332; count:   5326; per-instance: 0.06234   
             ParagraphWrapper:onAny:     297; count: 117036; per-instance: 0.00254   
       API response processing time:     152; count:     60; per-instance: 2.53333   
                   DOMPasses:NESTED:     126
...
...
-------------------------------------------------------------------------------------
30.11user 0.39system 0:35.60elapsed 85%CPU (0avgtext+0avgdata 718296maxresident)k
0inputs+0outputs (0major+231458minor)pagefaults 0swaps

Sample of time profile without <onlyinclude>:

Recorded times (in ms) for various parse components

                   TOTAL PARSE TIME:   33037
                      TOTAL IO TIME:   89805
                  TOTAL CPU(?) TIME:   19896

CAVEATS:
* Multiple IO requests overlap with each other.
* CPU and IO activity overlap with each other.
--------------------------------------------------------------------------------
                       BatchRequest:   88859; count:     77; per-instance: 1154.01299
                      ConfigRequest:     648; count:      1; per-instance: 648       
                   ParamInfoRequest:     298; count:      1; per-instance: 298       

                 Total API requests:      79
         # non-batched API requests:       2
                          # batches:      77
          # API requests in batches:    2563
-------------------------------------------------------------------------------------
                      DOMPasses:TOP:    7854
                  HTML5 TreeBuilder:    2752
          AttributeExpander:onToken:    1697; count: 231978; per-instance: 0.00732   
                            SyncTTM:    1655
                 AsyncTTM (Partial):    1385
                  Setup Environment:    1052
         WikiLinkHandler:onWikiLink:     852; count:  11668; per-instance: 0.07302   
                                PEG:     649
             SanitizerHandler:onAny:     458; count: 122604; per-instance: 0.00374   
             ParagraphWrapper:onAny:     427; count: 117034; per-instance: 0.00365   
         TemplateHandler:onTemplate:     384; count:   5326; per-instance: 0.0721    
           TokenStreamPatcher:onAny:     182; count: 117287; per-instance: 0.00155   
       API response processing time:     146; count:     79; per-instance: 1.8481    
           OnlyInclude:onAnyInclude:      69; count:  55429; per-instance: 0.00124   
       ExtensionHandler:onExtension:      65; count:     38; per-instance: 1.71053   
                   buildDOMFragment:      60; count:   7379; per-instance: 0.00813   
                   DOMPasses:NESTED:      48
...
...
-------------------------------------------------------------------------------------
37.72user 0.36system 0:51.35elapsed 74%CPU (0avgtext+0avgdata 528344maxresident)k

The IO time and batch time increase is simply because everything is waiting for the synchronous PEG parse to finish up.
But, the other major difference seems to be in the reported PEG time ... which doesn't make any sense to me. Is that faulty profiling? Total PEG time should not change whether we do sync tokenizing or async tokenizing unless this is causing some backtracking pathology to kick in.

But, the other major difference seems to be in the reported PEG time ... which doesn't make any sense to me. Is that faulty profiling? Total PEG time should not change whether we do sync tokenizing or async tokenizing unless this is causing some backtracking pathology to kick in.

I can look into it if you aren't already

But, the other major difference seems to be in the reported PEG time ... which doesn't make any sense to me. Is that faulty profiling? Total PEG time should not change whether we do sync tokenizing or async tokenizing unless this is causing some backtracking pathology to kick in.

I can look into it if you aren't already

No, I am not. So, go ahead!

Change 470072 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Capture async tokenizing time

https://gerrit.wikimedia.org/r/470072

Is that faulty profiling?

This was suspiciously low for something we're claiming is 10% of the parse,

PEG:     649

Change 470072 above brings that up an order of magnitude. Sync tokenizing the above still takes longer because,

unless this is causing some backtracking pathology to kick in.

is unfortunately correct. It looks like the recursive sync tokenizing call happens 3 times,
https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs#L249

An obvious place where this can be improved is,
https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs#L2174
where we do the parse just to end up backing out afterwards. But that's not being hit on this page, just an observation.

A first thought is to revive a patch like,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/398171
where the include contents only gets tokenized after being emitted ... but the problem there is that tokens end up as attributes of other tokens and whatnot, so there's no obvious way to ensure all that expansion happens.

Change 470072 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Capture async tokenizing time

https://gerrit.wikimedia.org/r/470072