Page MenuHomePhabricator

doBlockLevels() inserts <p> and </p> randomly with no regard for HTML validity
Open, LowPublic

Description

After doBlockLevels() sees an empty line, it starts hunting for a place to insert a start tag. It does this by checking if the line contains various assorted open or close tags (div, blockquote, th, etc.) -- if it gets a match, then it skips the line, and continues on to the next line. If there is no match, then it inserts the <p> at the start of the line.

With a <p> tag already open, the algorithm for </p> insertion is much the same in the opposite direction, i.e. presence of div etc. forces immediate insertion of </p> at the start of the line (regardless of where in the line the actual tag was found).

This is not equivalent to an HTML parsing algorithm. As an example of the mischief this can cause, consider the input:

aaaa

<div></div><span id="1">
bbbb
</span>
<span id="2">
cccc
<div></div>
</span>

This produces the output:

<p>aaaa
</p>
<div></div><span id="1">
<p>bbbb
</span>
<span id="2">
cccc
</p>
<div></div>
<p></span>
</p>

The empty line following "aaaa" sets a flag, and the first empty div suppresses insertion of the <p> on that line, so the "bbbb" line gets the <p>. Then with the second empty div, we trigger insertion of </p> prior. Thus the paragraph straddles the boundary between the first two spans. The final </span> does not suppress anything, so it gets wrapped in its own paragraph.

This is a reduction of the Rachael Flatt test case on Parsing/Replacing Tidy and so blocks T89331, since Depurate handles such broken HTML differently to Tidy.

Related Objects

Mentioned In
T257651: Reply tool messes up replying to users with signatures followed by html comments
T253233: Provide magic word for 'soft' newline insertion from template markup.
T11996: Multiline tags in lists should be output more intelligently
T230683: New syntax for multiline list items / talk page comments
T208901: TemplateStyles breaks a paragraph if a file is inserted inline
T198219: Line based p-wrapping can't match Remex
T68655: Multiline code tag results in strange html
T192909: Dirty diff and other weird corruption in table edit
T187906: Text breaks out of p tag unexpectedly
T187065: <del> tag with multi-paragraph content improperly handled by parser
T176363: New high-priority Linter category for a subset of misnested tags whose behavior will change with a HTML5 parser
T92040: Parsoid generates invalid HTML5 when a list is nested inside a <small> tag
T161306: Investigate P-wrapping oddity that introduces long horizontal no-wrap lines on many navboxes on shwiki
Mentioned Here
T253233: Provide magic word for 'soft' newline insertion from template markup.
P11223 More P-wrapping comments.
T55784: [EPIC] Use Parsoid HTML for all page views
T253072: [EPIC] Remove most p-wrapping from mediawiki
T230683: New syntax for multiline list items / talk page comments
T227953: Annoyance: Parser mis-"tidies" end of embedded <span> inside a <div> unless it has additional EOL context, Linter reports as a "Missing end tag","Stripped tag" error pair.
T11207: Single newlines sometimes create paragraphs
T123023: Newlines in <div> tags cause <p> tags to wrap wrong content
T114072: <section> tags for MediaWiki sections
T110004: DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.
T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 5 2016, 6:42 AM
tstarling updated the task description. (Show Details)May 5 2016, 9:23 AM
cscott added a comment.May 5 2016, 5:00 PM

@subbu points out that there's a related Parsoid bug (T110004: DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.) since Parsoid, with accurate knowledge of the in-progress HTML5 tree structure, doesn't faithfully emulate what the PHP's inaccurate view of the tree structure leads it to do.

May be worth mentioning that I've got an HTML5 tree builder written in PHP in my back pocket. It might be useful either for identifying the places where PHP is "doing it wrong" (and thus adding them to a tracking category for fixup) and/or implementing "doing it right".

BTW, this DOM pass I mention in T110004 is now part of HTML5depurate. That DOM pass doesn't change the behavior of the doBlockLevels p-wrapping or Parsoid's faithful emulation of the same.

However, a DOM pass in Parsoid & HTML5Depurate might be a better solution compared to the hacky p-wrapping solution in doBlockLevels. But, that is a different ball of wax.

Not sure how easy it is to fix this bug without DOM knowledge .. you would need to construct some kind of DOM structure information to deal with this.

To some degree these kinds of limitations are inherent in the design of the PHP 'parser'. Fixing them all properly would amount to rewriting Parsoid in PHP, which would be a very expensive endeavor.

What is your current thinking on the longer term parser roadmap? Do you see us consolidating on a single parser at some point?

tstarling triaged this task as Medium priority.May 6 2016, 2:10 AM

The intention of the current code is superficially similar to the tidy p-wrapping algorithm: to wrap text nodes and inline elements which are children of the root node. After supposedly detecting entry into a block element, $inBlockElem is set, suppressing emission of <p> tags until exit from the block element is detected and $inBlockElem is cleared.

However, the definition of a "block element" is much narrower than the HTML 4 definition. Significantly it does not include <div>, so paragraph wrapping continues inside a div element. It includes only table, tr, td, th, h1-h6, pre, p, ul, ol, dl and li.

The inclusion of li etc. is strange since you could imagine wanting to have multi-paragraph list items:

<ul><li>p1

p2
</li></ul>

Produces the evidently broken output:

<ul><li>p1p2</li></ul>

Parsoid at least leaves the line breaks in. The user might reasonably expect <ul><li><p>p1</p><p>p2</p></li></ul> , which is valid HTML 4.

Similarly you might want to have multi-paragraph table rows. This explains the on-wiki use of <p> in input text. I can't find any open bug for this.

My next job will be to analyse Parsoid's ParagraphWrapper.js, maybe we can take some ideas from it. For the test case in the description it gives:

<p>aaaa</p>

<div></div><p><span id="1"></span></p>
<p>bbbb

<span id="2">
cccc</span></p>
<div></div>
<p></p>

Which is valid HTML, at least, but the early close of the span tag does not correspond to the user's intention and would potentially cause visible changes compared to MW+tidy.

To some degree these kinds of limitations are inherent in the design of the PHP 'parser'. Fixing them all properly would amount to rewriting Parsoid in PHP, which would be a very expensive endeavor.

I don't think that's true at all. I think if we can produce a token stream transform specification of doBlockLevels(), then we can implement it in MW without much difficulty. DBL runs so late that there's not much interaction with other main-pass syntax to consider.

My next job will be to analyse Parsoid's ParagraphWrapper.js, maybe we can take some ideas from it.

I very painstakingly tried to mimic the combined effect of PHP parser + Tidy on this .... but, there are still edge cases where Parsoid's output differs because we use the HTML5 parser to parse the tag soup. So, there isn't a whole lot you will learn new from analyzing Parsoid's p-wrapper, except maybe that it is based on a token stream and a state machine of sorts instead of being regexp based ... but maybe that is what you meant. The current code also predates some bugs (long since fixed) in the token transformers that this worked around. I have always wanted to rewrite it in the PreWrapper's FSM style, but it never rose high enough in priority.

Suppose doBlockLevels does the following:

  • At the root or in a blockquote, a <p> is automatically opened
  • Inside all other elements, the contents is buffered until the first empty line is encountered. Then the element contents up to that point is wrapped in <p></p>, and a second <p> is opened. If a <p> was already open in an ancestor nesting level, it is closed in the ancestor buffer.
  • On encountering the end of an element, if a <p> is open at the same stack level, close it.
  • When a p is open, encountering a block-level element causes the <p> to be closed, and then after the block-level element is closed, another <p> is automatically opened.

Also, unrelated to p-wrapping, I think we should close HTML tags opened in wikitext list items at the end of the relevant list item, like we do with bold/italic in doAllQuotes.

This algorithm is in terms of a balanced HTML token stream, and there is the question of what to do when the input is not balanced. We could preserve the unbalanced stream for Tidy to clean up. To maintain the stack state, when an end tag is found that doesn't match the current start tag, we could just look up the stack for an ancestor to close. Even a fairly crude attempt like this would be much better than what we have now.

Another possibility is to try something equivalent to HTML 5 parsing, keeping in mind that we are not reserializing. We are just keeping an HTML-like stack state while modifying the event stream.

Another possibility is to try something equivalent to HTML 5 parsing, keeping in mind that we are not reserializing. We are just keeping an HTML-like stack state while modifying the event stream.

Other features like proper section wrapping (see T114072) will benefit from a DOM as well, so +1 for moving towards HTML5 parsing. But, as you know, that is basically what Parsoid is already doing.

ssastry added a comment.EditedMay 10 2016, 5:14 PM

Here are my thoughts on several different pieces at once.

Once we replace Tidy with HTML5depurate, it opens up the possibility of moving some things to DOM passes in HTML5depurate (as I indicated in T134469#2267605). The question that comes up is the usual one -- what to do with 3rd party installs which continues to be an unresolved question for mediawiki the software package ... i.e the same qn. that dogs Parsoid will dog this as well. So, given that, the best we can do is some hacky simulation of tag stacks. I don't have a preference / objection to this approach. Sometimes that is the best that can be done. We even have a couple of these in Parsoid as well for handling parse scenarios before we get to the DOM so that we can mimic core parser behavior.

Given that we will always have some edge cases in p-wrapping behavior (whether done as regexps on a string, or done based on stack emulations on a token stream), I personally don't think this task should be a blocker for replacing Tidy as long as (1) we can determine that a small subset of page renderings will break (2) we can identify those and (3) make fixes to the affected templates / page wikitext.

@GWicke, as for your earlier qn. about consolidating around a single parser, I think no one will disagree that it is desirable. That is a strawman goal in my mind. The sticking points have been around how to get there given (a) the corpus of wikitext, templates, gadgets, tools, html scrapers, extensions based on php-parser-lifecycle-hooks that are tied to current toolchain and HTML output (b) the 3rd party wikis that exist out there and absence of clarity around what kind of support is desired and what can be supported reasonably, and who has to support them.

All that said, given long-term maintenance considerations, ease of developing newer features and functionality, it seems clear to me that the direction to migrate towards is more DOM-based structural semantics for wikitext and less string-based semantics.

Given that, the strategy that we have taken is the following:

  • gradually migrate the output of PHP parser and Parsoid closer together .. this happens in both directions. Fix Parsoid's output / add hacks to Parsoid in some cases. Fix PHP parser output to move closer to Parsoid's interpretation in other cases
  • gradually address technical debt in the corpus by deprecating broken behavior, fixing content via bots, editors, template authors .. whatever works best.
  • gradually shift behavior of wikitext and templates to be more determistic and structural (DOM_based)

We've come to this strategy based on what is doable and what is less likely to run into serious troubled waters and allowing all the wider set of dependent tools, etc. to migrate. Given how involved even Parsoid's data-mw split has been involving identifying processes, protocols, negotation about timing, this is understandable. Tidy replacement with a HTML5 parser is going to be one big accomplishment. But, as we've seen there are lots and lots of existing behavior that has to be accounted for and migrated / broken in careful ways. So, I don't see a strategy (given requirements earlier) that allows for a sudden consolidation around Parsoid or PHP parser (since that will have to implement all of Parsoid's functionality while being highly performant).

@ssastry, I'm mainly wondering

a) how far down the road towards maintaining two DOM-based parsers you are willing to walk, and
b) whether you still aim to make Parsoid a full replacement for the PHP parser by adding missing features like link updates and some hook support.

To me, the worst possible outcome appears to be paying for maintainance of two almost-identical parsers for a long time, but only having bi-directional conversion support in one of the two. I am a bit worried that we are currently drifting in this direction, but would be happy to learn that this is not the case, and you are actually aiming for one of the two main alternatives (PHP parser to replace Parsoid, or Parsoid to replace PHP parser).

To repeat, the concerns are not about having one parser on the WMF cluster serving both read and edit views, but about how to get to that goal.

The place we are moving towards is bringing the output of the two parsers closer together. That will enable Parsoid to replace the PHP parser on the WMF cluster without having to break a ton of stuff all at once. For example, all the stuff that breaks with replacing Tidy with HTML5Depurate will also break with Parsoid. By replacing Tidy with a full HTML5 parser, we get closer to being able to use Parsoid output for read views as well. So, yes the plan for the WMF cluster is still to have Parsoid be the default engine for read and edit views.

@ssastry, thanks for the clarification.

Do you plan to implement new DOM-based logic like section wrapping in both Parsoid and PHP?

Can you take the broader discussion somewhere else please? Maybe file a separate task for it?

The changes to the core parser will require equivalent changes (which will hopefully be a cleanup from the current ugliness in paragraph-wrapper) in Parsoid. We should co-ordinate these changes a bit. Stuff like https://github.com/wikimedia/parsoid/blob/2f68a9cb504a332ea2dec3ea948ebfcffcc4d141/lib/config/WikitextConstants.js#L123-L158 will likely change / be removed.

cscott added a comment.Jun 8 2016, 9:26 PM

Given that we will always have some edge cases in p-wrapping behavior (whether done as regexps on a string, or done based on stack emulations on a token stream), I personally don't think this task should be a blocker for replacing Tidy as long as (1) we can determine that a small subset of page renderings will break (2) we can identify those and (3) make fixes to the affected templates / page wikitext.

Since <p>-wrapping is not actually necessary for HTML5 validity any more -- it's just a matter of "legacy" compatibility with current stylesheets/gadgets/etc -- I'd like to see if we can gradually deprecate it. In ten years, I'd love for us to be at the point where we don't do <p>-wrapping at all, it would simplify a lot of things. In the near term, I agree that we can not stress about the edge cases too much, and then hopefully gradually turn off <p>-wrapping in some of the stranger edge cases, with appropriate notice/community involvement/etc.

Another weirdness that I discovered caused by this behavior:

<span style="color:red">
*a
*b
</span>

generates

<p><span style="color:red"></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></span></p>

Tidy takes and for some unknown reason converts that to:

<ul>
<li><span style="color:red">a</span></li>
<li><span style="color:red">b</span></li>
</ul>

whereas Parsoid and RemexHTML convert that to:

<p><span style="color:red"></span></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></p>

So, it looks like doBlockLevels weirdness combines with Tidy weirdness to generate the intended result. But, there is no such cancellation effect with Parsoid and Remex. So, either the wikitext would have to be fixed or doBlockLevels would have to be fixed.

This can be fixed in Parsoid by changing the p-wrapping algorithm and/or doing it on the DOM.

ssastry added a comment.EditedApr 13 2017, 12:31 PM

So, I think this bug is restricted to <span>.

Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Parsoid) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Tidy vs RemexHTML). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.

I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, I think this bug is restricted to <span>.

Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Tidy vs RemexHTML) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Parsoid). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.

I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, at least for Parsoid, this is not because of anything that Parsoid does special ... but the result of domino's standard html5 tree building algorithm. So, if we want to fix this in Parsoid, we need to add special-case handling.

Nirmos added a subscriber: Nirmos.Jun 28 2017, 6:31 AM

Is this the same as T123023?

Is this the same as T11207?

ssastry lowered the priority of this task from Medium to Low.Apr 11 2018, 8:23 PM
ssastry added subscribers: Tgr, Ltrlg, matmarex and 2 others.

We won't fix this in the PHP parser since it requires DOM context which it doesn't have. In any case, since the goal is to make Parsoid the wikitext parser for mediawiki, and Parsoid can handle this better, there isn't a reason to hack this in the PHP parser at this point. But, leaving this open if someone feels an itch and wants to tackle it. But, I'll lower the priority on this.

This comment was removed by ssastry.
Izno added a subscriber: Izno.Jul 14 2019, 1:39 AM

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

$ (echo '{{left margin|5em|{{smaller|' ; echo 'Foo<br/>' ; echo 'Bar<br/>' ; echo 'Bat}}}}') | bin/parse.js --domain en.wikisource.org --usePHPPreProcessor=false
[...]
<div style="margin-left:5em;" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","a":{"style":"margin-left:5em;"},"sa":{"style":"margin-left:{{{1}}};"},"dsr":[0,54,null,null],"pi":[[{"k":"1"},{"k":"2"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"left margin","href":"./Template:Left_margin"},"params":{"1":{"wt":"5em"},"2":{"wt":"{{smaller|\nFoo&lt;br/>\nBar&lt;br/>\nBat}}"}},"i":0}}]}'>
<p><span style="font-size: 83%;">
Foo<br/>
Bar<br/></span></p>
Bat</div><link rel="mw:PageProp/Category" href="./Category:Pages_using_left-margin_with_text" about="#mwt1" data-parsoid='{"stx":"simple","a":{"href":"./Category:Pages_using_left-margin_with_text"},"sa":{"href":"Category:Pages using left-margin with text"}}'/>
</body></html>

We also migrate the </p> before Bat. Alas.

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

Absolutely, we had to. We worked hard to get there. :-) But, once we switch over to Parsoid, we can lint pages and undo.

Is there a bug or FIXME in Parsoid which indicates where this crazy behavior is implemented aka where to emit lint messages and then eventually where to remove the back-compat code?

All in ParagraphWrapper token transformer code.

Some discussion in T230683#5793251 and T230683#5798972, although that has more to do with general doBlockLevels brokenness and not necessarily <p> tag insertion in doBlockLevels. Eg, compare:

::<dl><dd>foo
bar</dd></dl>

with

**<dl><dd>foo
bar</dd></dl>

The results vary because doBlockLevels has no idea where it is in the list structure when it sees the newline after foo -- so it outputs some random tags, and then Remex has to clean the mess up the best it can.

In response to csott-

See https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium/Help&oldid=9909322#Why_does_templating_something_generate_an_error? for another example of possible doBlockLevels weirdness. ( the template concerned being: https://en.wikisource.org/wiki/Template:Left). Using the template inside a definition list generated a lint-error. The level generation glitch here seems to be dependent on a line feed before the closing block level tag (in this instance a DIV). When I removed that line-feed https://en.wikisource.org/wiki/Template:Left-shift it wrapped nicely.(No LintErrors).

Qgil removed a subscriber: Qgil.Feb 10 2020, 1:23 PM

Another example https://species.wikimedia.org/w/index.php?title=Cattleya_aclandiae&oldid=7533297

here the transition is UL -> DIV -> TABLE. Here the DIV generated by the Commons template is unexpectedly terminated meaning the box around the commons category mis-renders. (The BOX around the DIV should entirely enclose the content.) I can't find any obvious stray linespace in the template markup.

Izno added a comment.May 13 2020, 9:10 PM

Another example https://species.wikimedia.org/w/index.php?title=Cattleya_aclandiae&oldid=7533297

here the transition is UL -> DIV -> TABLE. Here the DIV generated by the Commons template is unexpectedly terminated meaning the box around the commons category mis-renders. (The BOX around the DIV should entirely enclose the content.) I can't find any obvious stray linespace in the template markup.

Separate issue. <table> cannot be placed inside <ul>/<li>.

This comment was removed by ShakespeareFan00.
This comment was removed by ShakespeareFan00.
ShakespeareFan00 added a comment.EditedMay 18 2020, 8:52 AM

https://en.wikisource.org/wiki/Template:Cl-act-p/testcases#Testing_start_and_end_version_for_higher_level.

Here the only additional whitespace is the line feeds between the template starts, the nominal text and the template end call as far as I can determine.

Given the normal behaviour of mediawiki when not calling templates, it would not be unreasonable to expect the two examples to behave identically (collapsing line feeds), as they should be reasonably expected to generate identical code ( if it were not for the white space handling).

The first test case here was the INTENDED use of the template, and was until the parser migration, how the template HAD been used (It was removed from a lot of pages because the changes in whitespace handling, had broken the presentation on those pages.).

@ShakespeareFan00: T134469#6134833 talks about "UL -> DIV -> TABLE" (which lacks a required LI). Then T134469#6135113 suddenly talks about "TABLE in a LI" (no DIV mentioned). I'm afraid that this task is becoming an unfocused ticket for random catch-alls which would not be helpful.

This comment was removed by ShakespeareFan00.

I've removed the followup comment you mention. If you think the UL->DIV->TABLE issue should be moved to a new ticket, so this one doesn't become a "laundry" list. I'm happy to oblige (once I have a clearly idea about what I think might be going on).
~~~~

Put some additional comments (and a test case) re p-wrapping behavior from @ShakespeareFan00 in P11223. To @Aklapper's point, this is already a grab bag point to some degree. The legacy parser will be replaced entirely by Parsoid at some point (T55784: [EPIC] Use Parsoid HTML for all page views) and then we'll have Parsoid's slightly-less-broken version to deal with. In reality, the best way to fix p-wrapping is to remove it entirely in most cases (T253072) if we can get consensus on that.

This comment was removed by ShakespeareFan00.