Parsoid renders trailing newlines in tables as paragraphs, making ptwiki infoboxes huge
Closed, ResolvedPublic

Description

As reported on
https://pt.wikipedia.org/wiki/Project:Editor_Visual/Coment%C3%A1rios?oldid=36203746#Infobox
there is too much space in the infobox when we open
https://pt.wikipedia.org/wiki/Albert_Einstein?veaction=edit
(tested on Google Chrome 28.0.1500.52 and Firefox 21.0, using Linux Mint 15).

Maybe this is a problema with a local template or CSS, but I was unable to figure that out by inpecting the HTML code...

The problem do not happen on enwiki:
https://en.wikipedia.org/wiki/Albert_Einstein?veaction=edit


Version: unspecified
Severity: normal

bzimport added a project: Parsoid.Via ConduitNov 22 2014, 1:59 AM
bzimport set Reference to bz50122.
He7d3r created this task.Via LegacyJun 24 2013, 7:06 PM
He7d3r added a comment.Via ConduitJun 24 2013, 7:08 PM

Created attachment 12626
http://s24.postimg.org/l4s9um8s5/userbox.jpg

Attached:

Jdforrester-WMF added a comment.Via ConduitJun 24 2013, 7:23 PM

This is probably due to CE's tinkering with templates so that they float correctly; the scale of the change is I imagine a template-specific issue, but it would be good to confirm.

matmarex added a comment.Via ConduitJun 24 2013, 7:29 PM

Actually, this looks like VE is inserting an additional empty cell at the beginning of every infobox table row.

Jdforrester-WMF added a comment.Via ConduitJun 24 2013, 7:35 PM

(In reply to comment #3)

Actually, this looks like VE is inserting an additional empty cell at the
beginning of every infobox table row.

s/VE/Parsoid/g ?

Catrope added a comment.Via ConduitJul 1 2014, 12:26 AM

This is now looking much better, but it's still not fixed. There's lots of whitespace added at the bottom, causing the infobox to be much much taller than it should be.

Moving to Parsoid: the table is also too big in http://parsoid-lb.eqiad.wikimedia.org/ptwiki/Albert_Einstein?oldid=39283789

This happens because, in the expansion of the template, lots and lots of newlines (more specifically, the pattern newline-newline-space or sometimes newline-newline-newline-space) are produced. You can see these in http://pt.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvexpandtemplates&format=yamlfm&titles=Albert_Einstein (search for "\n\n\n \n\n"). This seems to be due to templates like https://pt.wikipedia.org/w/index.php?title=Predefini%C3%A7%C3%A3o:Info&action=edit with lots of if statements (in the 200s) that leak newlines.

The result is a table that looks like:
{|

-
Realrowshere
-

....

-
Lastrowhere
-

[lots of newlines]

}

The wikitext parser ignores these newlines, but Parsoid converts them all to paragraphs inside of a row with a single cell. In this case, this row contains about 232 paragraphs and has a total height of 5615px.

I haven't been able to reproduce this locally though. Parsoid seems to quash the newlines just fine, even if I try to generate them with {{echo}}.

ssastry added a comment.Via ConduitJul 2 2014, 10:47 PM

(In reply to Roan Kattouw from comment #5)

I haven't been able to reproduce this locally though. Parsoid seems to quash
the newlines just fine, even if I try to generate them with {{echo}}.

This might be some weird tokenizer thing ... after a lot of trial-and-error, I have a reduced test case where the tokenizer output subtly changes beyond a threshold of newlines ... I'll continue on this later, but with a wikitext that is 136 lines long (including table closing, ending, and other tags), the tokenizer just generates a tr-tag without a td-tag => no p-br mess. At 138 lines long, the tokenizer seems to generate a td-tag after which then induces p-br wrapping since all the newlines are considered "content".

https://gist.githubusercontent.com/subbuss/32c0fe97734381e1f026/raw/58c2815289a24a42753da14085518696bddb57e9/gistfile1.txt has the p-br .. now try removing 2 of the empty lines at the beginning or end. and boom, the p-brs disappear.

To be continued ... running out now ...

cscott added a comment.Via ConduitJul 2 2014, 10:49 PM

Perhaps we're hitting a limitation in our lookahead and/or backtracking?

ssastry added a comment.Via ConduitJul 3 2014, 4:27 AM

Line 224 of mediawiki.tokenizer.utils.js (which is a heuristic that looks upto 200 chars ahead to determine if we have non-whitespace content) is the problem. If the table has more than 200 chars whitespace, this effectively means that the tokenizer assumes that valid content follows and inserts a <td> there.

In this context, another relevant consideration is Parsoid's implicit-td-insertion behavior (which is what is kicking in above). I don't know why we have this behavior. See my comments around line 5500 of https://gerrit.wikimedia.org/r/#/c/133957/2/tests/parser/parserTests.txt (which is the WIP patch to add tidy support to parser tests).

So, two things to check here:
(1) why do we have implicit-td-insertion behavior?
(2) what is the best way to fix the 200-char lookahead issue in the inline_breaks test in the tokenizer?

Fixing either one will fix the problem for this test case.

gerritbot added a comment.Via ConduitJul 3 2014, 9:05 PM

Change 144061 had a related patch set uploaded by Subramanya Sastry:
(Bug 50122) Removed fragile lookahead in inline_breaks + improve perf.

https://gerrit.wikimedia.org/r/144061

gerritbot added a comment.Via ConduitJul 3 2014, 11:19 PM

Change 144061 merged by jenkins-bot:
(Bug 50122) Removed fragile lookahead in inline_breaks + improve perf.

https://gerrit.wikimedia.org/r/144061

ssastry added a comment.Via ConduitJul 7 2014, 8:06 PM

This fix is now deployed. The implicit-td-insertion behavior will be investigated separately as part of tidy support work.

Add Comment