Page MenuHomePhabricator

Slow parsing / huge pages from HTML dump error logs: timeouts, crashers
Closed, ResolvedPublic

Description

The following pages are reported as timing out the Parsoid backend requests in the restbase logstash report (search for ETIMEDOUT pagebundle in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase):

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: Parsoid.
GWicke added a subscriber: GWicke.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 8 2015, 2:18 AM
GWicke updated the task description. (Show Details)Feb 8 2015, 2:20 AM
GWicke set Security to None.
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)Feb 8 2015, 2:23 AM
GWicke updated the task description. (Show Details)Feb 8 2015, 3:36 AM
GWicke updated the task description. (Show Details)
ssastry added a subscriber: ssastry.Feb 8 2015, 3:59 AM

Quick note I am recording here as I glance at this ticket on the weekend:

GWicke updated the task description. (Show Details)Feb 8 2015, 6:24 PM
Arlolra triaged this task as Medium priority.Feb 10 2015, 2:20 AM
Arlolra raised the priority of this task from Medium to High.
Arlolra moved this task from Needs Triage to Performance on the Parsoid board.
Arlolra added a subscriber: Arlolra.
marcoil moved this task from Performance to Needs Triage on the Parsoid board.Feb 13 2015, 12:51 PM
ssastry moved this task from Needs Triage to In Progress on the Parsoid board.Mar 3 2015, 9:10 PM
GWicke renamed this task from Slow parsing / huge pages from HTML dump error logs to Slow parsing / huge pages from HTML dump error logs: timeouts, crashers.Mar 5 2015, 9:31 PM

Trying some of these locally now, I get:

[info][enwiki/List_of_20th-century_classical_composers?oldid=656432750] completed parsing in 53805 ms
[info][enwiki/List_of_Detroit_Red_Wings_players?oldid=648754782] completed parsing in 49185 ms
[info][enwiki/List_of_Los_Angeles_Kings_players?oldid=648542956] completed parsing in 63519 ms

So, it looks like some of these may have been transient errors. Or, it might be that they are timing out because of cpu starvation from lots of big pages getting parsed around the same time. In any case, we should still look if processing of any of these humongous pages can be sped up anywhere.

A few more candidates from the latest dump run:

http://rest.wikimedia.org/zh.wikipedia.org/v1/page/html/2014%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E6%9D%91%E9%87%8C%E9%95%B7%E9%81%B8%E8%88%89%E9%81%B8%E8%88%89%E5%8D%80%E6%8A%95%E7%A5%A8%E7%B5%90%E6%9E%9C%E5%88%97%E8%A1%A8%EF%BC%8D%E8%87%BA%E7%81%A3%E5%8C%97%E9%83%A8%E5%9C%B0%E5%8D%80/35105585
http://rest.wikimedia.org/zh.wikipedia.org/v1/page/html/2014%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E6%9D%91%E9%87%8C%E9%95%B7%E9%81%B8%E8%88%89%E9%81%B8%E8%88%89%E5%8D%80%E6%8A%95%E7%A5%A8%E7%B5%90%E6%9E%9C%E5%88%97%E8%A1%A8%EF%BC%8D%E8%87%BA%E7%81%A3%E4%B8%AD%E9%83%A8%E5%9C%B0%E5%8D%80/34624388
http://rest.wikimedia.org/zh.wikipedia.org/v1/page/html/2014%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E6%9D%91%E9%87%8C%E9%95%B7%E9%81%B8%E8%88%89%E9%81%B8%E8%88%89%E5%8D%80%E6%8A%95%E7%A5%A8%E7%B5%90%E6%9E%9C%E5%88%97%E8%A1%A8%EF%BC%8D%E8%87%BA%E7%81%A3%E5%8D%97%E9%83%A8%E5%9C%B0%E5%8D%80/34877225
http://rest.wikimedia.org/zh.wikipedia.org/v1/page/html/2014%E5%B9%B4%E4%B8%AD%E8%8F%AF%E6%B0%91%E5%9C%8B%E9%84%89%E9%8E%AE%E5%B8%82%E5%8D%80%E6%B0%91%E4%BB%A3%E8%A1%A8%E9%81%B8%E8%88%89%E9%81%B8%E8%88%89%E5%8D%80%E6%8A%95%E7%A5%A8%E7%B5%90%E6%9E%9C%E5%88%97%E8%A1%A8/35031515
http://rest.wikimedia.org/ur.wikipedia.org/v1/page/html/%D9%86%D8%A7%D9%85_%D9%85%D9%82%D8%A7%D9%85%D8%A7%D8%AA_%D8%A7%DB%92/1269282
http://rest.wikimedia.org/sq.wikipedia.org/v1/page/html/Marsi/1467248
http://rest.wikimedia.org/ro.wikipedia.org/v1/page/html/List%C4%83_de_nume_rom%C3%A2ne%C8%99ti_-_litera_B/8954060
http://rest.wikimedia.org/ro.wikipedia.org/v1/page/html/List%C4%83_de_nume_rom%C3%A2ne%C8%99ti_-_litera_C/8954061
http://rest.wikimedia.org/ru.wikipedia.org/v1/page/html/%D0%9A%D0%BB%D0%B8%D0%BC%D0%B0%D1%82_%D0%9C%D0%B8%D0%BD%D1%81%D0%BA%D0%B0/69923818
http://rest.wikimedia.org/pt.wikipedia.org/v1/page/html/G%C3%A9neros_de_angiosp%C3%A9rmicas/40814271
http://rest.wikimedia.org/pt.wikipedia.org/v1/page/html/Lista_de_munic%C3%ADpios_do_Brasil_por_popula%C3%A7%C3%A3o/41954872
http://rest.wikimedia.org/pt.wikipedia.org/v1/page/html/Lista_de_munic%C3%ADpios_do_Brasil_por_%C3%A1rea/41954891
http://rest.wikimedia.org/pl.wikipedia.org/v1/page/html/Zwi%C4%85zki_chemiczne_wed%C5%82ug_pierwiastk%C3%B3w/40829624

Ding, ding ding .. http://localhost:8000/sqwiki/Marsi is a winner .... we have a candidate to debug.

@ssastry, you are welcome ;)

Looks like this could be another instance of an infinite loop like the one we fixed in https://gerrit.wikimedia.org/r/#/c/189036/ .. to be continued later.

This snippet here is sufficient to send this into a tailspin .. Try it with "node parse --trace peg --prefix sqwiki < /tmp/wt"

foo <ref name="Marsi">{{cite article
| author = NASA
| title = The Lure of Hematite
| quote = bar
| newspaper =
| date = March 28, 2001  
| pages = 1
| url = http://science.nasa.gov/science-news/science-at-nasa/2001/ast28mar_1/
| accessdate = October 6, 2014
}}</ref>
ssastry added a comment.EditedApr 21 2015, 5:11 AM

Aha .. what fun! Check this out: https://sq.wikipedia.org/w/api.php?action=expandtemplates&text={{cite%20article|%20author%20=%20NASA|%20title%20=%20The%20Lure%20of%20Hematite|%20quote%20=%20Marsi%20%28greqisht:%20Ares%29%20eshte%20Zoti%20i%20luftes.%20Planeti%20ndoshta%20e%20mori%20kete%20emer%20per%20shkak%20te%20ngjyres%20se%20kuqe,%20dhe%20gjithashtu%20Marsi%20eshte%20referuar%20disa%20here%20si%20Planeti%20i%20Kuq.|%20newspaper%20=%20|%20date%20=%20March%2028,%202001|%20pages%20=%201|%20url%20=%20http://science.nasa.gov/science-news/science-at-nasa/2001/ast28mar_1/|%20accessdate%20=%20October%206,%202014}}

I'll save you the suspense. Expanding that transclusion gives you ....

=================================
Stampa:Cite_article
---------------------------------
<ref>{{cite article
| author = NASA
| title = The Lure of Hematite
| quote = Marsi (greqisht: Ares) eshte Zoti i luftes. Planeti ndoshta e mori kete emer per shkak te ngjyres se kuqe, dhe gjithashtu Marsi eshte referuar disa here si Planeti i Kuq.
| newspaper = 
| date = March 28, 2001
| pages = 1
| url = http://science.nasa.gov/science-news/science-at-nasa/2001/ast28mar_1/
| accessdate = October 6, 2014
}}</ref>
---------------------------------

and thus, you have the infinite loop! This output above is with "--dump tplsrc" flags passed to parse.js ... but, if you parse the result of the expandtemplates api call above, this is what you get.

Not sure what the other pages will yield .. but, something to check later is if this expandtemplates behavior is broken or not.

GWicke updated the task description. (Show Details)Jun 4 2015, 11:04 AM
GWicke updated the task description. (Show Details)

Created subtask for the Marsi infinite loop. Summary: MW tosses Parsoid a few bullets, Parsoid shoots itself in the foot. The fix will be in Parsoid.

As for [[List of 20th-century classical composers]], running it on my laptop gives 214 seconds of CPU usage in HHVM and 47 seconds in node. But running MW's expandtemplates on the whole article text only takes 10 seconds, which suggests that batching (T45888) could be a big win.

Note that this is 214s on a minimal test wiki with just Cite and ParserFunctions, I am not adding a lot of unnecessary startup overhead.

I'd also like to explore the 47s spent in node to see if there is any low-hanging fruit. For reference, MW transforms this article to HTML in 18s.

As for [[List of 20th-century classical composers]], running it on my laptop gives 214 seconds of CPU usage in HHVM and 47 seconds in node. But running MW's expandtemplates on the whole article text only takes 10 seconds, which suggests that batching (T45888) could be a big win.

Definitely appears to be the case. If this is only for expandtemplates (and not parse which runs into extension state issues), this might be doable.

Note that this is 214s on a minimal test wiki with just Cite and ParserFunctions, I am not adding a lot of unnecessary startup overhead.

I'd also like to explore the 47s spent in node to see if there is any low-hanging fruit. For reference, MW transforms this article to HTML in 18s.

I am curious to see how much memory usage and GC pressure there is. I imagine tokens and array creation contribute a lot to it -- I do remember the token transformation managers create a lot of arrays.

Another interesting experiment would be to see how node time changes if you had all the expandtemplates output cached in memory and available "instantly", i.e. how much of the 47 sec is i/o wait time vs. cpu time.

I'd also like to explore the 47s spent in node to see if there is any low-hanging fruit. For reference, MW transforms this article to HTML in 18s.

Also, another useful number to get would be to look at the total time spent in DOM transformations (which there are a *lot* of) and that should give us a good first cut sense of where that low hanging fruit might be.

I'd also like to explore the 47s spent in node to see if there is any low-hanging fruit. For reference, MW transforms this article to HTML in 18s.

Also, another useful number to get would be to look at the total time spent in DOM transformations (which there are a *lot* of) and that should give us a good first cut sense of where that low hanging fruit might be.

I just did some basic instrumentation in mediawiki.DOMPostProcessors.js and it looks like < 10% time is spent in DOM transforms across all invocations for this page. Same with the BO page.

I'd also like to explore the 47s spent in node to see if there is any low-hanging fruit. For reference, MW transforms this article to HTML in 18s.

Also, another useful number to get would be to look at the total time spent in DOM transformations (which there are a *lot* of) and that should give us a good first cut sense of where that low hanging fruit might be.

I just did some basic instrumentation in mediawiki.DOMPostProcessors.js and it looks like < 10% time is spent in DOM transforms across all invocations for this page. Same with the BO page.

Scratch that. I will have to redo the tests at a different time with a faster / reliable internet connection. I don't think the BO page takes 40s to parse and so it might have been a bad internet connection that is skewing the percentages for me.

Another interesting experiment would be to see how node time changes if you had all the expandtemplates output cached in memory and available "instantly", i.e. how much of the 47 sec is i/o wait time vs. cpu time.

That was 47s of CPU time, measured by taking a delta of the "time" column in ps before and after test execution. I didn't measure wall clock time.

Also, another useful number to get would be to look at the total time spent in DOM transformations (which there are a *lot* of) and that should give us a good first cut sense of where that low hanging fruit might be.

I'm not entirely sure if I'm abusing perf in the correct way. The following figures were generated by running perf report with the relevant function in -p, which selects all samples with the relevant function anywhere in the stack trace. Then I used the -x flag to exclude non-matching samples, and took the sample count from the resulting header.

FunctionEvent countRelative
prepareDOM77925798815.52%
wrapTemplates47296665593.35%
markTreeBuilderFixups38882610532.76%
computeDSR16457487741.17%
migrateTemplateMarkerMetas14561180581.03%
cleanupAndSaveDataParsoid12054366980.85%
handleUnbalancedTables4711300110.33%
migrateTrailingNLs3853050600.27%
handlePres1927091970.14%
processRefs111,065,8910.08%
unpackDOMFragments38,803,3740.03%
stripDoubleTDs2,239,8520.00%
stripEmptyElements643,8880.00%
stripMarkerMetas577,3340.00%
handleLIHack00.00%
handleLinkNeighbours00.00%
handleTableCellTemplates00.00%
logWikitextFixup00.00%
--
total of above21,920,285,63015.54%
doPostProcess20,060,665,05514.22%
total141,072,432,443100.00%

I am curious to see how much memory usage and GC pressure there is. I imagine tokens and array creation contribute a lot to it -- I do remember the token transformation managers create a lot of arrays.

By the same method as above, v8::internal::Heap::AllocateRaw() takes 4.6% of CPU.

I've been running a few tests with templates that expand to an empty string, to quantify the overhead of invoking a template. With MW involved as per normal, it is about 5ms of client-side CPU per template. Extrapolating, that explains about 16s of the previous 47s test case.

With the HTTP request stubbed out, replaced with a setImmediate() callback, it is about 2.6ms per template. With an immediate synchronous callback, it is about 2.5ms.

So the client-side overhead of making an async HTTP request is about 2.5ms, and that's how much we can expect to save by batching -- maybe 8s of client-side CPU in the motivating test case. So if the batched request is efficient enough to take less than 8s in HHVM, we may be able to improve overall latency, even if you assume an infinitely large HHVM cluster with no concurrency limit. We would also get a ~17% reduction in Parsoid CPU usage and ~95% reduction in MW CPU usage.

@GWicke suggested that my overhead calculations above might be biased due to HHVM causing bus contention and context switching in the normal case but not in the stubbed-out case. This seems to be the case. I re-ran the test with HTTP concurrency limited to 1, to eliminate context switching and reduce bus contention. The result was 0.9ms per template, down from 2.6ms.

@GWicke suggested that my overhead calculations above might be biased due to HHVM causing bus contention and context switching in the normal case but not in the stubbed-out case. This seems to be the case. I re-ran the test with HTTP concurrency limited to 1, to eliminate context switching and reduce bus contention. The result was 0.9ms per template, down from 2.6ms.

Sorry, I screwed up that one, ab was timing out and not doing as many iterations as I thought. Actually, with HTTP concurrency reduced to 1, the per-template client-side overhead hardly changes, it is still around 2.5ms.

@tstarling, could you check the throughput with https://github.com/jeffbski/bench-rest, to establish a baseline for a simple http request from node on your hardware?

tstarling added a comment.EditedJul 15 2015, 11:37 PM

With bench-rest, 100,000 iterations against an 11KB static file served by Apache on the same host:

  • concurrency=10: 1582 req/s, 64.525s total client CPU time, implying 0.64ms client-side CPU overhead per request, 0.63ms mean time between responses
  • concurrency=1 : 1365 req/s, 65.017s total client CPU time, implying 0.65ms client-side CPU overhead per request, 0.73ms mean time between responses

Apache was observed to be using about 20% of one CPU during the c=10 test, node about 105%.

Full results: P975

@tstarling, that's very close to the ~0.4ms I saw for small responses (150 bytes or so).

I do wonder where the other 2ms or so are coming from in the Parsoid test.

With https://gerrit.wikimedia.org/r/#/c/244588/ all the pages listed here (except for the Marsi page which has a bug filed against it for detecting <ref> loops) parse to completion on my laptop .. almost all of them in < 40 sec. A couple take over 60 sec.

@ssastry: Wow, that's an impressive result for a one-line change!

I should try without that patch as well just to see how many of them were fixed by this patch vs. how many were already parsing successfully without .. but anywhere where there is a node with 1000s of children, the patch could potentially improve latency since it eliminates the O(n^2) part.

ssastry closed this task as Resolved.Oct 9 2015, 12:52 AM
ssastry claimed this task.

Yup, all of them have similar timings on master as well. I am going to close this ticket. Let us open new tickets for any individual pages that have poor performance and is reproducible locally.