Page MenuHomePhabricator

Parse for these pages fails with OOM
Closed, ResolvedPublic

Description

$ node parse --useBatchAPI --page List_of_Marvel_Cinematic_Universe_films --oldid 684790839
..
FATAL ERROR: JS Allocation failed - process out of memory
Abort (core dumped)

Looks like this failure is specific to that oldid because the following one parses to completion in < 20 s.

$ node parse --useBatchAPI --page List_of_Marvel_Cinematic_Universe_films

The diff between those 2 oldids https://en.wikipedia.org/w/index.php?title=List_of_Marvel_Cinematic_Universe_films&type=revision&diff=684799743&oldid=684790839 shows some include directive changes.

Another page is: enwiki/List_of_doping_cases_in_athletics?oldid=684239207

Details

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry added a project: Parsoid.
ssastry subscribed.

The relevant change here is,

-
| colspan="1" | Paul Rudd<ref name="Phase3Update">
+	
| colspan="1" | Paul Rudd<ref name="Phase3Update" />
ssastry renamed this task from Parse for this page fails with OOM to Parse for these pages fails with OOM.Oct 10 2015, 4:18 AM
ssastry updated the task description. (Show Details)
ssastry set Security to None.

This seems to be a problem with retokenizing

<ref>Nick Harris: [http://www.dailymail.co.uk/sport/othersports/article-2323161/Paul-Edwards-takes-UK-Athletics-UK-Sport-Kings-College-London-court-drug-tests.html London's Olympic drugbusters 'misled me for years over the test that got me banned for life'], Daily Mail, 11 May 2013</ref><ref>{{Cite news|

via this.tokenizer.tokenizeTableCellAttributes(attributishPrefix) in reparseTemplatedAttributes in dom.t.TableFixups.js -- that seems to send the tokenizer in a tailspin.

Separately, the following seems broken in that same function.

// Try to re-parse the attributish text content
var attributishPrefix = attributishContent.txt.match(/^[^|]+\|/)[0];

It sees

<ref>Nick Harris: [http://www.dailymail.co.uk/sport/othersports/article-2323161/Paul-Edwards-takes-UK-Athletics-UK-Sport-Kings-College-London-court-drug-tests.html London's Olympic drugbusters 'misled me for years over the test that got me banned for life'], Daily Mail, 11 May 2013</ref><ref>{{Cite news|url=http://query.nytimes.com/gst/fullpage.html?sec=health&res=9804E1DE113FF93AA15753C1A962958260|work=The New York Times|title=SPORTS PEOPLE: TRACK AND FIELD; Shot-Putter Is Banned|date=1994-10-29|accessdate=2010-05-08}}</ref><ref>UK Anti-doping: [http://www.ukad.org.uk/anti-doping-rule-violations/current-violations/search/P40 Current sanctions]</ref>

and blindly looks for a pipe char. That is a broken test as in this case where the pipe char is found within extension content. So, that needs a smarter test than a blind regexp.

The tokenizer production broke with the Aug 26th deploy https://www.mediawiki.org/wiki/Parsoid/Deployments#Wednesday.2C_August_26.2C_2015_around_1:10_pm_PT:_Y.C2.A0Deployed_44d657de .. specifically the cleanups applied there.

This also explains the mysterious p99 bump that we see in http://grafana.wikimedia.org/#/dashboard/db/parsoid-wt2html-trends?panelId=21&fullscreen end of August. So far, I kept looking at express and kept dismissing it as unlikely to be a source and kept looking for other explanations but hadn't found a satisfactory explanation for it ... till now. I first brought up the uptick end-of-August on IRC and thought it was because of express deploy but dismissed it. Brought up the express thing again last week on email and IRC conversations and dismissed it again. However, never thought about looking at the Aug 26th deploy (instead of the 25th deploy!!).

Anyway, that uptick in p99 numbers is because of all these pages that had been OOM-ing since Aug 26th and they effectively pushed up the p99 values.

cc @GWicke, @tstarling, @cscott -- so, there is some value to paying attention to long-term p99 trends after all and asking the question about why the trends changed.

ssastry raised the priority of this task from Medium to High.

So, there are 2 separate problems as I noted in T115072#1717106 -- the first one is a tokenizer production breakage which is what needs addressing as part of this ticket.

The second one I identified there is a longstanding problem of correctness of the reparsing code in some scenarios -- that functionality fix is outside the scope of this ticket.

A third thing to do is to start graphing SIGABRT event rate by sending graphite events whenever that happens. It is likely we would have picked this up much quicker if we had that monitoring in place.

The sense of joy with which you write, @ssastry, it's palpable.

The sense of joy with which you write, @ssastry, it's palpable.

May the force (of perception) be with you!

Change 244887 had a related patch set uploaded (by Arlolra):
Fix OOM issue: our old favourite (exp*)

https://gerrit.wikimedia.org/r/244887

Change 244887 merged by jenkins-bot:
Fix OOM issue: our old favourite (exp*)

https://gerrit.wikimedia.org/r/244887

A third thing to do is to start graphing SIGABRT event rate by sending graphite events whenever that happens. It is likely we would have picked this up much quicker if we had that monitoring in place.

Created T115185 for this.

The second one I identified there is a longstanding problem of correctness of the reparsing code in some scenarios -- that functionality fix is outside the scope of this ticket.

Created T115186

ssastry removed a subscriber: gerritbot.