Page MenuHomePhabricator

Introduce various limits during parsing to deal with pathological page scenarios
Closed, ResolvedPublic

Description

For about the last 5-6 hours (since around 6-7am CT, Dec 9), Parsoid cluster load has been in the 80% and higher range, and there are also a lot of CPU timeouts. Turns out this is the expression of T119883 in full glory because there are a bunch of bots creating multiple edits per minute (in the <10 byte range for each edit) on a really large page. So, every minute, multiple parse requests are queued via RESTBase for this large page that is going to time out anyway. Here is another specimen.

I think it is time to institute various parsing limits within Parsoid till such time we get around to being able to deal with these pages. Here are some possible limit features to consider:

  • Size of wikitext
  • Size of an individual list
  • Size of an individual table
  • Number of transclusions
  • Number of images
  • Expected size of DOM (based on # of tokens constructed and would be fed into the HTML tree builder)

Given that these pathological pages will never yield a result, and that they are only going to be making the cluster sluggish, it makes sense to detect these failure scenarios early and return a http 500. As Parsoid gets stronger muscles to deal with these "use wikitext as a database" scenarios, we can progressively relax limits.

Event Timeline

ssastry raised the priority of this task from to High.
ssastry updated the task description. (Show Details)
ssastry added a project: Parsoid.
ssastry added subscribers: ssastry, Services, tstarling.

Change 257944 had a related patch set uploaded (by Subramanya Sastry):
WIP: T120972: Introduce configurable wt2html/html2wt limits

https://gerrit.wikimedia.org/r/257944

Change 257944 merged by jenkins-bot:
T120972: Introduce configurable wt2html/html2wt limits

https://gerrit.wikimedia.org/r/257944

ssastry closed this task as Resolved.EditedDec 13 2015, 4:18 PM
ssastry claimed this task.

This is now deployed. We return a http 413 (payload too large) errors for these requests.

By cutting out all requests with wikitext > 1M or with list items > 30K or or table cells > 30K, there have been zero request timeouts (as logged in Kibana) and about 12 cpu timeouts in 40+ hours since this has been dpeloyed. This has also kept the ganglia load graph almost flat.

The urwiki bot-edited pages that caused the severe load spikes that prompted this task are covered by the list item limit. Looking at parsoid logs on various nodes, besides those urwiki pages, there have been a handful of pages that exercise the table cell limit. The other big source of http 413 are T75412: OCG Attribution request times out reguarly -- and about 50 of these requests an hour exceed the 1M wikitext size limit.

ssastry set Security to None.
ssastry removed a subscriber: gerritbot.

Addressing T119883: Investigate inefficiencies in DOM construction and passes for large wikitext pages should help us increase these limits to maybe 50K list items, table cells, and 1.5M wikitext size?