Page MenuHomePhabricator

Parsoid timing out or failing when trying to parse specific user page
Closed, ResolvedPublic

Description

Parsoid load went up significantly this morning in eqiad (the cluster that serves the live traffic), and I traced the issue to the inability of parsoid to parse most revisions of a specific talk page on cebwiki, resulting often in timeouts or the worker dying.

This seems all to be caused by REST-API-Crawler-Google/1.0, which is trying to parse all revisions of the page.

An extract from parsoid logs:

"Timed out processing: cebwiki/Gumagamit:Lsjbot/Kartrutor2?oldid=12301712"
"worker 25444 died (1), restarting."
"Timed out processing: cebwiki/Gumagamit:Lsjbot/Kartrutor2?oldid=12301924"
"worker 25274 died (1), restarting."
"Timed out processing: cebwiki/Gumagamit:Lsjbot/Kartrutor2?oldid=12301844"
"worker 25434 died (1), restarting."
"Timed out processing: cebwiki/Gumagamit:Lsjbot/Kartrutor2?oldid=12301924"
"worker 25464 died (1), restarting."
"Timed out processing: cebwiki/Gumagamit:Lsjbot/Kartrutor2?oldid=12301963"
"worker 25133 died (1), restarting."

The page is long and makes extensive usage of a lua module, https://ceb.wikipedia.org/wiki/Module:KML

While the issue is under control, more or less, in terms of load of the cluster, this is causing workers to die and thus some requests might be in-flight and fail for real users.

This should thus be treated with the highest priority.

Event Timeline

Joe renamed this task from Parsoid unable to parse specific user page to Parsoid timing out or failing when trying to parse specific user page.Jan 18 2017, 8:53 AM
Joe claimed this task.
Joe added a project: User-Joe.

Isolating a single request, I see that most of the time is spent in executing

v8::internal::VisitWeakList<v8::internal::JSFunction>

and that parsing, even when successful, requires ~ 180 seconds.

I am now trying to determine why the worker dies.

Strace gives little more information, besides the fact for each of these pages parsoid does hundreds of preprocessing requests to the MW API. Maybe some recursion limit is reached?

mobrovac subscribed.

The request limit in Parsoid is set to 110s, after which the worker commits suicide.

I will blacklist this specific title in RESTBase for now.

The PHP parser also gives up with lots of errors like this on the page:

...
S08W039 Lua error: too many expensive function calls.
...

Anyway, looks like Parsoid is missing some resource limit / state to detect this scenario. But, bot-driven pages (which are basically proxies for a database table) are usually the ones that give Parsoid trouble.

I will blacklist this specific title in RESTBase for now.

Marco is it something that Ops could do in case of fire? If so, is the procedure written down somewhere?

@elukey apparently this needs a code deploy, which means accepting a pull request on github (sic) where not everyone from ops has the ability to merge a PR (I do as I'm an admin of the wikimedia github org, but YMMV), then you need to check that into the gerrit-based deploy repo, then restbase uses some ansible recipe (sic, again) to be deployed instead of scap3 or trebuchet.

So, we definitely need someone from the services team doing it.

The correct way to handle this would be, of course, to allow ops to control at least part of the blacklist via puppet, or to standardize the deployment process at least.

I remember there were talks about moving restbase to scap3, but I don't think that has happened until now.

Joe moved this task from Backlog to Doing on the User-Joe board.