Page MenuHomePhabricator

Excessive backtracking in attribute_preprocessor_text_line when parsing table cell causes 'hanging' workers in production
Closed, DeclinedPublic

Description

Several busy ('hanging') workers in production were backtracking when parsing pathological tables in http://el.wikipedia.org/wiki/%CE%A0%CE%BF%CF%81%CE%B5%CE%AF%CE%B1_%CF%84%CF%89%CE%BD_%CE%BA%CF%85%CF%80%CF%81%CE%B9%CE%B1%CE%BA%CF%8E%CE%BD_%CE%BF%CE%BC%CE%AC%CE%B4%CF%89%CE%BD_%CF%83%CF%84%CE%B1_%CE%BA%CF%8D%CF%80%CE%B5%CE%BB%CE%BB%CE%B1_%CE%95%CF%85%CF%81%CF%8E%CF%80%CE%B7%CF%82
I tracked this down by attaching the node debugger to those workers.

Backtracking when parsing table cells with optional attributes is hard to avoid, but in this case there might be a bug in cache key construction for memoization. The presence of plenty of quotes additionally slows down potential-attribute parsing here.

I have some WIP code that speeds things up a lot by avoiding to parse attributes with clearly invalid names, but get some failures in tests where the PHP parser simply strips invalid attribute names. Needs further investigation.


Version: unspecified
Severity: normal

Details

Reference
bz51457

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:03 AM
bzimport added a project: Parsoid-Tokenizer.
bzimport set Reference to bz51457.

Change 74527 had a related patch set uploaded by GWicke:
Bug 51457: Avoid some table attribute parsing backtracking

https://gerrit.wikimedia.org/r/74527

Change 74527 merged by jenkins-bot:
Bug 51457: Avoid some backtracking when tokenizing table attributes

https://gerrit.wikimedia.org/r/74527

Lowering priority as this patch improves parse times from hours to minutes. I'm not closing this bug yet as there might be more optimization potential here, and there might be other hang causes too. We'll see how it goes after the next deployment.

This is not a problem in production any more. There have not been major hangs in the last weeks: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1377540766&g=cpu_report&z=large

I'm keeping this bug open, but am de-assigning it in case somebody else would like to continue optimizing attribute tokenization.

That page seems to parse pretty reasonably these days. Too bad there's no oldid provided here.

[info][elwiki/Πορεία_των_κυπριακών_ομάδων_στα_ευρωπαϊκά_κύπελλα_ποδοσφαίρου?oldid=4859043] redirecting to revision 4859043
[info][elwiki/Πορεία_των_κυπριακών_ομάδων_στα_ευρωπαϊκά_κύπελλα_ποδοσφαίρου?oldid=4859043] started parsing
[info][elwiki/Πορεία_των_κυπριακών_ομάδων_στα_ευρωπαϊκά_κύπελλα_ποδοσφαίρου?oldid=4859043] completed parsing in 15396 ms
[info][elwiki/Πορεία_των_κυπριακών_ομάδων_στα_ευρωπαϊκά_κύπελλα_ποδοσφαίρου?oldid=1678641] started parsing
[info][elwiki/Πορεία_των_κυπριακών_ομάδων_στα_ευρωπαϊκά_κύπελλα_ποδοσφαίρου?oldid=1678641] completed parsing in 12970 ms