Page MenuHomePhabricator

doBlockLevels() inserts <p> and </p> randomly with no regard for HTML validity
Open, LowPublic

Description

After doBlockLevels() sees an empty line, it starts hunting for a place to insert a start tag. It does this by checking if the line contains various assorted open or close tags (div, blockquote, th, etc.) -- if it gets a match, then it skips the line, and continues on to the next line. If there is no match, then it inserts the <p> at the start of the line.

With a <p> tag already open, the algorithm for </p> insertion is much the same in the opposite direction, i.e. presence of div etc. forces immediate insertion of </p> at the start of the line (regardless of where in the line the actual tag was found).

This is not equivalent to an HTML parsing algorithm. As an example of the mischief this can cause, consider the input:

aaaa

<div></div><span id="1">
bbbb
</span>
<span id="2">
cccc
<div></div>
</span>

This produces the output:

<p>aaaa
</p>
<div></div><span id="1">
<p>bbbb
</span>
<span id="2">
cccc
</p>
<div></div>
<p></span>
</p>

The empty line following "aaaa" sets a flag, and the first empty div suppresses insertion of the <p> on that line, so the "bbbb" line gets the <p>. Then with the second empty div, we trigger insertion of </p> prior. Thus the paragraph straddles the boundary between the first two spans. The final </span> does not suppress anything, so it gets wrapped in its own paragraph.

This is a reduction of the Rachael Flatt test case on Parsing/Replacing Tidy and so blocks T89331, since Depurate handles such broken HTML differently to Tidy.

Related Objects

Mentioned In
T361663: Parsoid incorporates text after a balanced template into the template if immediately preceded by an image
T357090: Error in generated HTML
T311827: Linter does not detect P inside SPAN, where SPAN is inside DIV based content.
T303705: Linter incorrectly lists a page on which the closing <span> tag is on a different line
T275475: <div> followed by <span>.*\n throws lint errors
T322775: Allow <thead> <tbody> <tfoot> as literal HTML tags in Wikitext
T269549: Inconsistency between parsers handling paragraphs in a table division
T257651: Reply tool messes up replying to users with signatures followed by html comments
T253233: Provide magic word for 'soft' newline insertion from template markup.
T11996: Multiline tags in lists should be output more intelligently
T230683: New syntax for multiline list items / talk page comments
T208901: TemplateStyles breaks a paragraph if a file is inserted inline
T198219: Line based p-wrapping can't match Remex
T68655: Multiline code tag results in strange html
T192909: Dirty diff and other weird corruption in table edit
T187906: Text breaks out of p tag unexpectedly
T187065: <del> tag with multi-paragraph content improperly handled by parser
T176363: New high-priority Linter category for a subset of misnested tags whose behavior will change with a HTML5 parser
T92040: Parsoid generates invalid HTML5 when a list is nested inside a <small> tag
T161306: Investigate P-wrapping oddity that introduces long horizontal no-wrap lines on many navboxes on shwiki
Mentioned Here
T253233: Provide magic word for 'soft' newline insertion from template markup.
P11223 More P-wrapping comments.
T55784: [EPIC] Use Parsoid HTML for all page views
T253072: [EPIC] Remove most p-wrapping from mediawiki
T230683: New syntax for multiline list items / talk page comments
T227953: Annoyance: Parser mis-"tidies" end of embedded <span> inside a <div> unless it has additional EOL context, Linter reports as a "Missing end tag","Stripped tag" error pair.
T11207: Single newlines sometimes create paragraphs
T123023: Newlines in <div> tags cause <p> tags to wrap wrong content
T114072: <section> tags for MediaWiki sections
T110004: DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.
T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@ssastry, I'm mainly wondering

a) how far down the road towards maintaining two DOM-based parsers you are willing to walk, and
b) whether you still aim to make Parsoid a full replacement for the PHP parser by adding missing features like link updates and some hook support.

To me, the worst possible outcome appears to be paying for maintainance of two almost-identical parsers for a long time, but only having bi-directional conversion support in one of the two. I am a bit worried that we are currently drifting in this direction, but would be happy to learn that this is not the case, and you are actually aiming for one of the two main alternatives (PHP parser to replace Parsoid, or Parsoid to replace PHP parser).

To repeat, the concerns are not about having one parser on the WMF cluster serving both read and edit views, but about how to get to that goal.

The place we are moving towards is bringing the output of the two parsers closer together. That will enable Parsoid to replace the PHP parser on the WMF cluster without having to break a ton of stuff all at once. For example, all the stuff that breaks with replacing Tidy with HTML5Depurate will also break with Parsoid. By replacing Tidy with a full HTML5 parser, we get closer to being able to use Parsoid output for read views as well. So, yes the plan for the WMF cluster is still to have Parsoid be the default engine for read and edit views.

@ssastry, thanks for the clarification.

Do you plan to implement new DOM-based logic like section wrapping in both Parsoid and PHP?

Can you take the broader discussion somewhere else please? Maybe file a separate task for it?

The changes to the core parser will require equivalent changes (which will hopefully be a cleanup from the current ugliness in paragraph-wrapper) in Parsoid. We should co-ordinate these changes a bit. Stuff like https://github.com/wikimedia/parsoid/blob/2f68a9cb504a332ea2dec3ea948ebfcffcc4d141/lib/config/WikitextConstants.js#L123-L158 will likely change / be removed.

Given that we will always have some edge cases in p-wrapping behavior (whether done as regexps on a string, or done based on stack emulations on a token stream), I personally don't think this task should be a blocker for replacing Tidy as long as (1) we can determine that a small subset of page renderings will break (2) we can identify those and (3) make fixes to the affected templates / page wikitext.

Since <p>-wrapping is not actually necessary for HTML5 validity any more -- it's just a matter of "legacy" compatibility with current stylesheets/gadgets/etc -- I'd like to see if we can gradually deprecate it. In ten years, I'd love for us to be at the point where we don't do <p>-wrapping at all, it would simplify a lot of things. In the near term, I agree that we can not stress about the edge cases too much, and then hopefully gradually turn off <p>-wrapping in some of the stranger edge cases, with appropriate notice/community involvement/etc.

Another weirdness that I discovered caused by this behavior:

<span style="color:red">
*a
*b
</span>

generates

<p><span style="color:red"></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></span></p>

Tidy takes and for some unknown reason converts that to:

<ul>
<li><span style="color:red">a</span></li>
<li><span style="color:red">b</span></li>
</ul>

whereas Parsoid and RemexHTML convert that to:

<p><span style="color:red"></span></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></p>

So, it looks like doBlockLevels weirdness combines with Tidy weirdness to generate the intended result. But, there is no such cancellation effect with Parsoid and Remex. So, either the wikitext would have to be fixed or doBlockLevels would have to be fixed.

This can be fixed in Parsoid by changing the p-wrapping algorithm and/or doing it on the DOM.

So, I think this bug is restricted to <span>.

Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Parsoid) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Tidy vs RemexHTML). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.

I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, I think this bug is restricted to <span>.

Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Tidy vs RemexHTML) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Parsoid). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.

I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, at least for Parsoid, this is not because of anything that Parsoid does special ... but the result of domino's standard html5 tree building algorithm. So, if we want to fix this in Parsoid, we need to add special-case handling.

ssastry lowered the priority of this task from Medium to Low.Apr 11 2018, 8:23 PM
ssastry added subscribers: Tgr, Ltrlg, matmarex and 2 others.

We won't fix this in the PHP parser since it requires DOM context which it doesn't have. In any case, since the goal is to make Parsoid the wikitext parser for mediawiki, and Parsoid can handle this better, there isn't a reason to hack this in the PHP parser at this point. But, leaving this open if someone feels an itch and wants to tackle it. But, I'll lower the priority on this.

This comment was removed by ssastry.

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

$ (echo '{{left margin|5em|{{smaller|' ; echo 'Foo<br/>' ; echo 'Bar<br/>' ; echo 'Bat}}}}') | bin/parse.js --domain en.wikisource.org --usePHPPreProcessor=false
[...]
<div style="margin-left:5em;" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","a":{"style":"margin-left:5em;"},"sa":{"style":"margin-left:{{{1}}};"},"dsr":[0,54,null,null],"pi":[[{"k":"1"},{"k":"2"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"left margin","href":"./Template:Left_margin"},"params":{"1":{"wt":"5em"},"2":{"wt":"{{smaller|\nFoo&lt;br/>\nBar&lt;br/>\nBat}}"}},"i":0}}]}'>
<p><span style="font-size: 83%;">
Foo<br/>
Bar<br/></span></p>
Bat</div><link rel="mw:PageProp/Category" href="./Category:Pages_using_left-margin_with_text" about="#mwt1" data-parsoid='{"stx":"simple","a":{"href":"./Category:Pages_using_left-margin_with_text"},"sa":{"href":"Category:Pages using left-margin with text"}}'/>
</body></html>

We also migrate the </p> before Bat. Alas.

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

Absolutely, we had to. We worked hard to get there. :-) But, once we switch over to Parsoid, we can lint pages and undo.

Is there a bug or FIXME in Parsoid which indicates where this crazy behavior is implemented aka where to emit lint messages and then eventually where to remove the back-compat code?

All in ParagraphWrapper token transformer code.

Some discussion in T230683#5793251 and T230683#5798972, although that has more to do with general doBlockLevels brokenness and not necessarily <p> tag insertion in doBlockLevels. Eg, compare:

::<dl><dd>foo
bar</dd></dl>

with

**<dl><dd>foo
bar</dd></dl>

The results vary because doBlockLevels has no idea where it is in the list structure when it sees the newline after foo -- so it outputs some random tags, and then Remex has to clean the mess up the best it can.

In response to csott-

See https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium/Help&oldid=9909322#Why_does_templating_something_generate_an_error? for another example of possible doBlockLevels weirdness. ( the template concerned being: https://en.wikisource.org/wiki/Template:Left). Using the template inside a definition list generated a lint-error. The level generation glitch here seems to be dependent on a line feed before the closing block level tag (in this instance a DIV). When I removed that line-feed https://en.wikisource.org/wiki/Template:Left-shift it wrapped nicely.(No LintErrors).

Another example https://species.wikimedia.org/w/index.php?title=Cattleya_aclandiae&oldid=7533297

here the transition is UL -> DIV -> TABLE. Here the DIV generated by the Commons template is unexpectedly terminated meaning the box around the commons category mis-renders. (The BOX around the DIV should entirely enclose the content.) I can't find any obvious stray linespace in the template markup.

Another example https://species.wikimedia.org/w/index.php?title=Cattleya_aclandiae&oldid=7533297

here the transition is UL -> DIV -> TABLE. Here the DIV generated by the Commons template is unexpectedly terminated meaning the box around the commons category mis-renders. (The BOX around the DIV should entirely enclose the content.) I can't find any obvious stray linespace in the template markup.

Separate issue. <table> cannot be placed inside <ul>/<li>.

https://en.wikisource.org/wiki/Template:Cl-act-p/testcases#Testing_start_and_end_version_for_higher_level.

Here the only additional whitespace is the line feeds between the template starts, the nominal text and the template end call as far as I can determine.

Given the normal behaviour of mediawiki when not calling templates, it would not be unreasonable to expect the two examples to behave identically (collapsing line feeds), as they should be reasonably expected to generate identical code ( if it were not for the white space handling).

The first test case here was the INTENDED use of the template, and was until the parser migration, how the template HAD been used (It was removed from a lot of pages because the changes in whitespace handling, had broken the presentation on those pages.).

@ShakespeareFan00: T134469#6134833 talks about "UL -> DIV -> TABLE" (which lacks a required LI). Then T134469#6135113 suddenly talks about "TABLE in a LI" (no DIV mentioned). I'm afraid that this task is becoming an unfocused ticket for random catch-alls which would not be helpful.

I've removed the followup comment you mention. If you think the UL->DIV->TABLE issue should be moved to a new ticket, so this one doesn't become a "laundry" list. I'm happy to oblige (once I have a clearly idea about what I think might be going on).
~~~~

Put some additional comments (and a test case) re p-wrapping behavior from @ShakespeareFan00 in P11223. To @Aklapper's point, this is already a grab bag point to some degree. The legacy parser will be replaced entirely by Parsoid at some point (T55784: [EPIC] Use Parsoid HTML for all page views) and then we'll have Parsoid's slightly-less-broken version to deal with. In reality, the best way to fix p-wrapping is to remove it entirely in most cases (T253072) if we can get consensus on that.

… In ten years, I'd love for us to be at the point where we don't do <p>-wrapping at all …

4+ years on and I am still wasting hours debugging mystery formatting behaviour and inserting literal <p> tags in wikipages as a workaround because the parser can't grasp that <div> is a perfectly fine container for a text node, and behaves in effect like a per-line regex parser. And every year that passes adds more of these wonky workarounds that are going to make actually fixing this harder (cause more breakage) and require more manual labour to undo once the trigger is finally pulled.

Can we find some way, now, to opt out of the parser being "helpful" on a per-project or preferably more granular level?

Maybe a flag in a template (ala. TemplateStyles) that turns p-wrapping off entirely for the contents of that template? A property for a namespace ditto? A magic word in a page?

A new html-alike wikimarkup syntax (extension tag) along the lines of <nopwrap>…</nopwrap> that just disables p-wrapping inside and otherwise just outputs its input?

A Beta Feature to turn off p-wrapping on save? If we can selectively enable HHVM, surely we can figure out a way to selectively enable Parsoid with bug-compatibility disabled?

In use-case terms, I keep running into this in the context of white-space: pre-wrap: that is, a container inside which I want line breaks to be preserved. These will usually look like…

<div style="white-space: pre-wrap">first line
second line
third line</div>

…because the extra newline at the start and end of the block are non-intuitive for users and otherwise requires ugly hacks with negative top/bottom margins to compensate for. But the p-wrapping ends up producing…

<div style="white-space: pre-wrap">first line
<p>second line
third line</p></div>

…which produces inconsistence between the first line and the rest, and inserts visual gaps in a (seemingly) random way.

So to fool the parser you can use a <p> instead of a <div> as your container, but since not all elements can be contained in a <p> you then get unpredictable results when someone puts other content inside it (this is a template, recall). And white-space: pre-wrap makes every instance of this painfully visible.

p-wrapping needs to be killed with fire and ASAP, is what I'm saying!

And if that relatively small bit is waiting on The Great Unified Hypersuperduper Modern Parser For The Future (2000 Pro Extreme)™, we need to get them uncoupled somehow. Quirky legacy syntax that holds back modern parsers is nothing new on the web: but we lack the standards-mode flag that the wider web is using to enable such transitions.

I bet something like __NO_P_WRAP__ would be fairly easy to support. Would it get enough adoption to get us closer to our goal of turning it off by default?

I bet something like __NO_P_WRAP__ would be fairly easy to support. Would it get enough adoption to get us closer to our goal of turning it off by default?

Probably depends on what one considers to qualify as "enough" and how much "closer".

If we're worrying about hypothetical (but likely, of course) user scripts and styles that might break on enwp articles, then, no, probably not (little incentive to fix). If we're talking about cases where p-wrapping has been specifically identified as a problem… yes, I think so. If you've butted your head against this you'll jump at the chance to escape it, and a lot of these cases will be in templates (as that's mostly where this level of control matters).

But this kind of magic word approach is an escape hatch for those bitten by the old behaviour and not really a great tool for getting wholesale to the new behaviour. For a significant impact on that I think you'd need something like per-page __MODERN_MARKUP__ encompassing much more than just p-wrapping and a Special:LintErrors (or static list) the community could work on.

In use-case terms, I keep running into this in the context of white-space: pre-wrap: that is, a container inside which I want line breaks to be preserved.

Out of interest what does <poem> do? as being able to do

<div class=__poem>

<div>

with the pre-wrap set as template style for __poem means template styles can be used to format, something that's not as easy to do with the current poem tags, which can't as far as I know accept CSS classing. (maybe that's a different ticket.)

If you can move stuff from dedicated tags into templatestyles backed templates, they are much easier for interface admins to maintain, I think?

Please remove unneeded full quotes when replying, to keep things readable. Thanks!