Paragraph wrapping bug
Closed, ResolvedPublic

Description

Reported by jackmcbarn on IRC:

"a blank line gets lost (right under the Reformat proposal header) at http://parsoid-prod.wmflabs.org/_rtselser/enwiki/Module_talk:Zh?oldid=606919578"

Running this locally, and looking at the output HTML around the header, I see a bunch of text nodes after the <h2> that are occuring bare without any paragraph-wrapping. Some block-element (possibly misnested) is causing this to get suppressed.

To be investigated.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=56601
https://bugzilla.wikimedia.org/show_bug.cgi?id=66400
https://bugzilla.wikimedia.org/show_bug.cgi?id=68395
https://bugzilla.wikimedia.org/show_bug.cgi?id=69629

Details

Reference
bz64901
bzimport set Reference to bz64901.
ssastry created this task.May 5 2014, 3:20 PM
ssastry added a comment.EditedMay 8 2014, 11:35 PM

This bug can be reproduced with the following wikitext snippet. Parsoid doesn't wrap foo1 and foo2 in p-tags but (php parser + tidy combo) does.

foo1 <blockquote>bar1</blockquote>

foo2 <blockquote>bar2</blockquote>

That said, this requires some investigation of what is going on since we do have explicit tests with this expectation of paragraph-wrapping behavior. See https://github.com/wikimedia/mediawiki-core/blob/master/tests/parser/parserTests.txt#L1469-L1515

Since the PHP parser passes this test, it is Tidy that is adding the p-wrappers. We are targeting the PHP parser output whereas we should be targeting (PHP parser + Tidy) output.

Bare text is fine in HTML5, while it was illegal in XHTML4. IIRC this is the reason for tidy's aggressive p wrapping behavior.

This is all moot though as we are dealing with content written against this tidy behavior, so we might have to emulate at least some of it to preserve semantics. We should however check whether it makes sense to follow XHTML4 rules in all cases before adopting tidy behavior wholesale.

The lost line sounds like a somewhat orthogonal issue, likely with selser. It might be helped by paragraph wrapping, but ideally we would not depend on that to make selser work.

Perhaps the page CSS has the styles applied to P?

I will look into adding a 'tidy' option to parser tests to allow writing tests which target PHP+tidy behavior.

(In reply to Gabriel Wicke from comment #2)

The lost line sounds like a somewhat orthogonal issue, likely with selser.
It might be helped by paragraph wrapping, but ideally we would not depend on
that to make selser work.

No, this is not a selser issue. It is a p-wrapping issue. Since _rtselser modifies the top-level, all the top-level node separators go through regular wts and without a p-tag, the text-node following the p-tag doesn't get 2 newlines, just one.

(In reply to ssastry from comment #4)

(In reply to Gabriel Wicke from comment #2)

> The lost line sounds like a somewhat orthogonal issue, likely with selser.
> It might be helped by paragraph wrapping, but ideally we would not depend on
> that to make selser work.

No, this is not a selser issue. It is a p-wrapping issue. Since _rtselser
modifies the top-level, all the top-level node separators go through regular
wts and without a p-tag, the text-node following the p-tag doesn't get 2
newlines, just one.

That said, our wts constraints for separators for a p-tag is probably not accounting for this scenario. Perhaps the <P>-#TEXT constraint could be fixed.

(In reply to ssastry from comment #4)

(In reply to Gabriel Wicke from comment #2)

> The lost line sounds like a somewhat orthogonal issue, likely with selser.
> It might be helped by paragraph wrapping, but ideally we would not depend on
> that to make selser work.

No, this is not a selser issue. It is a p-wrapping issue. Since _rtselser
modifies the top-level, all the top-level node separators go through regular
wts and without a p-tag, the text-node following the p-tag doesn't get 2
newlines, just one.

Let me rephrase: There's no rule that selser has to work this way. We could also use our dsr information to substring the original wikitext between elements if the HTML text content there was not modified.

So selser is working the way it's currently designed, but there might still be room to improve the way our selser implementation solves the wider selser problem.

ssastry added a comment.EditedMay 16 2014, 11:01 PM

(In reply to Gabriel Wicke from comment #6)

(In reply to ssastry from comment #4)
> (In reply to Gabriel Wicke from comment #2)
>
> > The lost line sounds like a somewhat orthogonal issue, likely with selser.
> > It might be helped by paragraph wrapping, but ideally we would not depend on
> > that to make selser work.
>
> No, this is not a selser issue. It is a p-wrapping issue. Since _rtselser
> modifies the top-level, all the top-level node separators go through regular
> wts and without a p-tag, the text-node following the p-tag doesn't get 2
> newlines, just one.

Let me rephrase: There's no rule that selser has to work this way. We could
also use our dsr information to substring the original wikitext between
elements if the HTML text content there was not modified.

So selser is working the way it's currently designed, but there might still
be room to improve the way our selser implementation solves the wider selser
problem.

Actually, I misspoke. I remembered adding this selser functionality to separators already. So, I looked into the code again.

So, let us use a test snippet:

<blockquote>a</blockquote> b

c <blockquote>d</blockquote>

Save it, generate html, edit it with a comment on the end and run selser on it as follows:

[subbu@earth lib] node parse --trace selser --selser --oldtextfile /tmp/foobar --oldhtmlfile /tmp/foobar.html < /tmp/foobar.edited.html

and see this in the output:

[WTS]          | TEXT: " b\n\nc " ; SOL: false
[WTS]          | ---> SEP: ""
[WTS]          | sol false  b
c 
[WTS]          | ---> OUT: " b\nc "

The absence of the P-wrapping tag leads to the "b\n\nc" to be a single text-node. Since there is no diff-markers on text-nodes (something we can fix by augmenting our dom-diff maybe), the text nodes go through regular WTS and looks like our regular WTS is normalizing newlines there.

The newline normalization for content inside text nodes is probably happening to preserve html2html semantics of not introducing p-tags inside text-nodes where there weren't any before. So, makes sense on that front.

fwiw, the "add tidy support to parserTests" patches are https://gerrit.wikimedia.org/r/134160 and https://gerrit.wikimedia.org/r/134230

  • Bug 66400 has been marked as a duplicate of this bug. ***

Change 155735 had a related patch set uploaded by Cscott:
Sync parserTests with core.

https://gerrit.wikimedia.org/r/155735

Change 155735 merged by jenkins-bot:
Sync parserTests with core.

https://gerrit.wikimedia.org/r/155735

Change 157038 had a related patch set uploaded by Subramanya Sastry:
WIP (Bug 64901) Fix paragraph-wrapping to match PHP parser + Tidy combo

https://gerrit.wikimedia.org/r/157038

Change 157038 merged by jenkins-bot:
(Bug 64901) Fix paragraph-wrapping to match PHP parser + Tidy combo

https://gerrit.wikimedia.org/r/157038

Now fixed and rt-tested and looking good so far. Will be deployed on Monday (Sep 29).