These pages seem to be causing a lot of timeouts on the production cluster
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Capture async tokenizing time | mediawiki/services/parsoid | master | +15 -8 |
Related Objects
- Mentioned In
- T288889: Parsoid API not able to deliver HTML (HTTP 504) for certain (big) articles
T280381: 413 error while trying to fetch using desktop api
T207956: Token stream patcher table start retokenizing doesn't handle non-string tokens in table attribute position
T198618: TOC processing should strip contents of <style> tags
T167349: Tidy on WMF sites removes `<style>` tags
T186965: Output HTML should not contain `<p><style>...</style></p>`
T100841: Support for dynamically enabling new wikis
T207093: AssertionError: undefined - Mentioned Here
- T288889: Parsoid API not able to deliver HTML (HTTP 504) for certain (big) articles
rGPARb9fa661bf0c2: Capture async tokenizing time
T100841: Support for dynamically enabling new wikis
T167349: Tidy on WMF sites removes `<style>` tags
T186965: Output HTML should not contain `<p><style>...</style></p>`
T198618: TOC processing should strip contents of <style> tags
T207956: Token stream patcher table start retokenizing doesn't handle non-string tokens in table attribute position
T143378: Heartbeat timeout on User:Pdebee
T187848: Fix token transformer return types
T207093: AssertionError: undefined
Event Timeline
For https://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States, there is a lot of identical template calls. Parsoid should probably cache output of templates in case they are reused on the page. We should check how the PHP parser treats this problem. For example, if there are multiple occurrences of {{#time:U|now}} on a page, will they all report the same time? Or, will this only happen if that is wrapped in a template?
We'll revisit this after a port to PHP. We might be able to reuse the existing caching code instead of redoing it in parsoid.
Wrong diagnosis. The clerks page above fails because of hitting our existing resource limits.
If I bump the limits as follows, the page parses, albeit with a long parse time. But, caching will be useful in addressing the parse time in that scenario.
// Max table cells per page - maxTableCells: 30000, + maxTableCells: 100000, // Max transclusions per page - maxTransclusions: 10000, + maxTransclusions: 50000,
Wrong diagnosis
It may have been true prior to T207093 / T187848 when you filed this (well, it still exceeded resources but that was suppressed giving it time to timeout).
But note that the page mostly gets tokenized synchronously, like in T143378, since the majority of the content is wrapped in an <onlyinclude>.
Ah ok .. I didn't remember if I had tested it before I diagnosed that above or not. But, here is some interesting time profile.
I copied the wt of the page to a file. I parsed once with the <onlyinclude> as in the source wikitext and once by stripping it out.
Snippet of time profile with <onlyinclude>:
Recorded times (in ms) for various parse components TOTAL PARSE TIME: 49597 TOTAL IO TIME: 130839 TOTAL CPU(?) TIME: 32491 CAVEATS: * Multiple IO requests overlap with each other. * CPU and IO activity overlap with each other. -------------------------------------------------------------------------------- BatchRequest: 129902; count: 60; per-instance: 2165.03333 ConfigRequest: 637; count: 1; per-instance: 637 ParamInfoRequest: 300; count: 1; per-instance: 300 Total API requests: 62 # non-batched API requests: 4 # batches: 58 # API requests in batches: 2563 ------------------------------------------------------------------------------------- PEG: 14354 DOMPasses:TOP: 7296 HTML5 TreeBuilder: 2361 AttributeExpander:onToken: 2124; count: 231982; per-instance: 0.00916 SyncTTM: 1550 AsyncTTM (Partial): 1451 Setup Environment: 1029 WikiLinkHandler:onWikiLink: 653; count: 11668; per-instance: 0.05597 SanitizerHandler:onAny: 363; count: 122608; per-instance: 0.00296 TemplateHandler:onTemplate: 332; count: 5326; per-instance: 0.06234 ParagraphWrapper:onAny: 297; count: 117036; per-instance: 0.00254 API response processing time: 152; count: 60; per-instance: 2.53333 DOMPasses:NESTED: 126 ... ... ------------------------------------------------------------------------------------- 30.11user 0.39system 0:35.60elapsed 85%CPU (0avgtext+0avgdata 718296maxresident)k 0inputs+0outputs (0major+231458minor)pagefaults 0swaps
Sample of time profile without <onlyinclude>:
Recorded times (in ms) for various parse components TOTAL PARSE TIME: 33037 TOTAL IO TIME: 89805 TOTAL CPU(?) TIME: 19896 CAVEATS: * Multiple IO requests overlap with each other. * CPU and IO activity overlap with each other. -------------------------------------------------------------------------------- BatchRequest: 88859; count: 77; per-instance: 1154.01299 ConfigRequest: 648; count: 1; per-instance: 648 ParamInfoRequest: 298; count: 1; per-instance: 298 Total API requests: 79 # non-batched API requests: 2 # batches: 77 # API requests in batches: 2563 ------------------------------------------------------------------------------------- DOMPasses:TOP: 7854 HTML5 TreeBuilder: 2752 AttributeExpander:onToken: 1697; count: 231978; per-instance: 0.00732 SyncTTM: 1655 AsyncTTM (Partial): 1385 Setup Environment: 1052 WikiLinkHandler:onWikiLink: 852; count: 11668; per-instance: 0.07302 PEG: 649 SanitizerHandler:onAny: 458; count: 122604; per-instance: 0.00374 ParagraphWrapper:onAny: 427; count: 117034; per-instance: 0.00365 TemplateHandler:onTemplate: 384; count: 5326; per-instance: 0.0721 TokenStreamPatcher:onAny: 182; count: 117287; per-instance: 0.00155 API response processing time: 146; count: 79; per-instance: 1.8481 OnlyInclude:onAnyInclude: 69; count: 55429; per-instance: 0.00124 ExtensionHandler:onExtension: 65; count: 38; per-instance: 1.71053 buildDOMFragment: 60; count: 7379; per-instance: 0.00813 DOMPasses:NESTED: 48 ... ... ------------------------------------------------------------------------------------- 37.72user 0.36system 0:51.35elapsed 74%CPU (0avgtext+0avgdata 528344maxresident)k
The IO time and batch time increase is simply because everything is waiting for the synchronous PEG parse to finish up.
But, the other major difference seems to be in the reported PEG time ... which doesn't make any sense to me. Is that faulty profiling? Total PEG time should not change whether we do sync tokenizing or async tokenizing unless this is causing some backtracking pathology to kick in.
But, the other major difference seems to be in the reported PEG time ... which doesn't make any sense to me. Is that faulty profiling? Total PEG time should not change whether we do sync tokenizing or async tokenizing unless this is causing some backtracking pathology to kick in.
I can look into it if you aren't already
Change 470072 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Capture async tokenizing time
Is that faulty profiling?
This was suspiciously low for something we're claiming is 10% of the parse,
PEG: 649
Change 470072 above brings that up an order of magnitude. Sync tokenizing the above still takes longer because,
unless this is causing some backtracking pathology to kick in.
is unfortunately correct. It looks like the recursive sync tokenizing call happens 3 times,
https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs#L249
An obvious place where this can be improved is,
https://github.com/wikimedia/parsoid/blob/master/lib/wt2html/pegTokenizer.pegjs#L2174
where we do the parse just to end up backing out afterwards. But that's not being hit on this page, just an observation.
A first thought is to revive a patch like,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/398171
where the include contents only gets tokenized after being emitted ... but the problem there is that tokens end up as attributes of other tokens and whatnot, so there's no obvious way to ensure all that expansion happens.
Change 470072 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Capture async tokenizing time
@ssastry: I don't see what exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.
Kiwix users are continously reporting oddities regarding content and content formating. We are challenged to identify if the problem is in MWoffliner or upstream. To help identify where things go wrong, earlier this year, MWoffliner has started to be stricter regarding the Mediawiki backend HTTP responses. For example if we get HTTP 5xx errors, MWoffliner now dies. If the Parsoid does not deliver because of performance issue(s), then it will probably ends with an HTTP 504 or 502 error at the API backend level. This is why this ticket and all the siblings of that kind are really important to us. For the moment, one of this case within the 6M articles of WPEN, and MWoffliner just stops the whole scrape.
The clerks page was split up in January 2019 so should not be a problem anymore. I imported the AFC page with its templates to my local test install. It took 22 seconds with the August 30 version of Parsoid, which is comparable to production, but only 12 seconds with the current tip of my work branch. So I guess I fixed it already. I confirmed that production has an old version of Parsoid (a14, August 8).
Generally, template-heavy pages will have different performance characteristics in Parsoid/PHP than in Parsoid/JS, so test cases that used to be slow might not be slow anymore.
@tstarling Makes sense. Considering the age of this ticket, I focus myself on T288889 which is more recent and with a clearer symptom. I monitor it and update it time to time with new pages which don't render.