Page MenuHomePhabricator

doBlockLevels() inserts <p> and </p> randomly with no regard for HTML validity
Open, LowPublic

Description

After doBlockLevels() sees an empty line, it starts hunting for a place to insert a start tag. It does this by checking if the line contains various assorted open or close tags (div, blockquote, th, etc.) -- if it gets a match, then it skips the line, and continues on to the next line. If there is no match, then it inserts the <p> at the start of the line.

With a <p> tag already open, the algorithm for </p> insertion is much the same in the opposite direction, i.e. presence of div etc. forces immediate insertion of </p> at the start of the line (regardless of where in the line the actual tag was found).

This is not equivalent to an HTML parsing algorithm. As an example of the mischief this can cause, consider the input:

aaaa

<div></div><span id="1">
bbbb
</span>
<span id="2">
cccc
<div></div>
</span>

This produces the output:

<p>aaaa
</p>
<div></div><span id="1">
<p>bbbb
</span>
<span id="2">
cccc
</p>
<div></div>
<p></span>
</p>

The empty line following "aaaa" sets a flag, and the first empty div suppresses insertion of the <p> on that line, so the "bbbb" line gets the <p>. Then with the second empty div, we trigger insertion of </p> prior. Thus the paragraph straddles the boundary between the first two spans. The final </span> does not suppress anything, so it gets wrapped in its own paragraph.

This is a reduction of the Rachael Flatt test case on Parsing/Replacing Tidy and so blocks T89331, since Depurate handles such broken HTML differently to Tidy.

Related Objects

Event Timeline

tstarling created this task.May 5 2016, 6:42 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 5 2016, 6:42 AM
tstarling updated the task description. (Show Details)May 5 2016, 9:23 AM
cscott added a comment.May 5 2016, 5:00 PM

@subbu points out that there's a related Parsoid bug (T110004: DOM Pass for wrapping bare text found in <body> and other "block" (in html4-parlance) nodes like <blockquote>, <td>, <th>.) since Parsoid, with accurate knowledge of the in-progress HTML5 tree structure, doesn't faithfully emulate what the PHP's inaccurate view of the tree structure leads it to do.

May be worth mentioning that I've got an HTML5 tree builder written in PHP in my back pocket. It might be useful either for identifying the places where PHP is "doing it wrong" (and thus adding them to a tracking category for fixup) and/or implementing "doing it right".

BTW, this DOM pass I mention in T110004 is now part of HTML5depurate. That DOM pass doesn't change the behavior of the doBlockLevels p-wrapping or Parsoid's faithful emulation of the same.

However, a DOM pass in Parsoid & HTML5Depurate might be a better solution compared to the hacky p-wrapping solution in doBlockLevels. But, that is a different ball of wax.

Not sure how easy it is to fix this bug without DOM knowledge .. you would need to construct some kind of DOM structure information to deal with this.

To some degree these kinds of limitations are inherent in the design of the PHP 'parser'. Fixing them all properly would amount to rewriting Parsoid in PHP, which would be a very expensive endeavor.

What is your current thinking on the longer term parser roadmap? Do you see us consolidating on a single parser at some point?

tstarling triaged this task as Normal priority.May 6 2016, 2:10 AM

The intention of the current code is superficially similar to the tidy p-wrapping algorithm: to wrap text nodes and inline elements which are children of the root node. After supposedly detecting entry into a block element, $inBlockElem is set, suppressing emission of <p> tags until exit from the block element is detected and $inBlockElem is cleared.

However, the definition of a "block element" is much narrower than the HTML 4 definition. Significantly it does not include <div>, so paragraph wrapping continues inside a div element. It includes only table, tr, td, th, h1-h6, pre, p, ul, ol, dl and li.

The inclusion of li etc. is strange since you could imagine wanting to have multi-paragraph list items:

<ul><li>p1

p2
</li></ul>

Produces the evidently broken output:

<ul><li>p1p2</li></ul>

Parsoid at least leaves the line breaks in. The user might reasonably expect <ul><li><p>p1</p><p>p2</p></li></ul> , which is valid HTML 4.

Similarly you might want to have multi-paragraph table rows. This explains the on-wiki use of <p> in input text. I can't find any open bug for this.

My next job will be to analyse Parsoid's ParagraphWrapper.js, maybe we can take some ideas from it. For the test case in the description it gives:

<p>aaaa</p>

<div></div><p><span id="1"></span></p>
<p>bbbb

<span id="2">
cccc</span></p>
<div></div>
<p></p>

Which is valid HTML, at least, but the early close of the span tag does not correspond to the user's intention and would potentially cause visible changes compared to MW+tidy.

To some degree these kinds of limitations are inherent in the design of the PHP 'parser'. Fixing them all properly would amount to rewriting Parsoid in PHP, which would be a very expensive endeavor.

I don't think that's true at all. I think if we can produce a token stream transform specification of doBlockLevels(), then we can implement it in MW without much difficulty. DBL runs so late that there's not much interaction with other main-pass syntax to consider.

My next job will be to analyse Parsoid's ParagraphWrapper.js, maybe we can take some ideas from it.

I very painstakingly tried to mimic the combined effect of PHP parser + Tidy on this .... but, there are still edge cases where Parsoid's output differs because we use the HTML5 parser to parse the tag soup. So, there isn't a whole lot you will learn new from analyzing Parsoid's p-wrapper, except maybe that it is based on a token stream and a state machine of sorts instead of being regexp based ... but maybe that is what you meant. The current code also predates some bugs (long since fixed) in the token transformers that this worked around. I have always wanted to rewrite it in the PreWrapper's FSM style, but it never rose high enough in priority.

Suppose doBlockLevels does the following:

  • At the root or in a blockquote, a <p> is automatically opened
  • Inside all other elements, the contents is buffered until the first empty line is encountered. Then the element contents up to that point is wrapped in <p></p>, and a second <p> is opened. If a <p> was already open in an ancestor nesting level, it is closed in the ancestor buffer.
  • On encountering the end of an element, if a <p> is open at the same stack level, close it.
  • When a p is open, encountering a block-level element causes the <p> to be closed, and then after the block-level element is closed, another <p> is automatically opened.

Also, unrelated to p-wrapping, I think we should close HTML tags opened in wikitext list items at the end of the relevant list item, like we do with bold/italic in doAllQuotes.

This algorithm is in terms of a balanced HTML token stream, and there is the question of what to do when the input is not balanced. We could preserve the unbalanced stream for Tidy to clean up. To maintain the stack state, when an end tag is found that doesn't match the current start tag, we could just look up the stack for an ancestor to close. Even a fairly crude attempt like this would be much better than what we have now.

Another possibility is to try something equivalent to HTML 5 parsing, keeping in mind that we are not reserializing. We are just keeping an HTML-like stack state while modifying the event stream.

Another possibility is to try something equivalent to HTML 5 parsing, keeping in mind that we are not reserializing. We are just keeping an HTML-like stack state while modifying the event stream.

Other features like proper section wrapping (see T114072) will benefit from a DOM as well, so +1 for moving towards HTML5 parsing. But, as you know, that is basically what Parsoid is already doing.

ssastry added a comment.EditedMay 10 2016, 5:14 PM

Here are my thoughts on several different pieces at once.

Once we replace Tidy with HTML5depurate, it opens up the possibility of moving some things to DOM passes in HTML5depurate (as I indicated in T134469#2267605). The question that comes up is the usual one -- what to do with 3rd party installs which continues to be an unresolved question for mediawiki the software package ... i.e the same qn. that dogs Parsoid will dog this as well. So, given that, the best we can do is some hacky simulation of tag stacks. I don't have a preference / objection to this approach. Sometimes that is the best that can be done. We even have a couple of these in Parsoid as well for handling parse scenarios before we get to the DOM so that we can mimic core parser behavior.

Given that we will always have some edge cases in p-wrapping behavior (whether done as regexps on a string, or done based on stack emulations on a token stream), I personally don't think this task should be a blocker for replacing Tidy as long as (1) we can determine that a small subset of page renderings will break (2) we can identify those and (3) make fixes to the affected templates / page wikitext.

@GWicke, as for your earlier qn. about consolidating around a single parser, I think no one will disagree that it is desirable. That is a strawman goal in my mind. The sticking points have been around how to get there given (a) the corpus of wikitext, templates, gadgets, tools, html scrapers, extensions based on php-parser-lifecycle-hooks that are tied to current toolchain and HTML output (b) the 3rd party wikis that exist out there and absence of clarity around what kind of support is desired and what can be supported reasonably, and who has to support them.

All that said, given long-term maintenance considerations, ease of developing newer features and functionality, it seems clear to me that the direction to migrate towards is more DOM-based structural semantics for wikitext and less string-based semantics.

Given that, the strategy that we have taken is the following:

  • gradually migrate the output of PHP parser and Parsoid closer together .. this happens in both directions. Fix Parsoid's output / add hacks to Parsoid in some cases. Fix PHP parser output to move closer to Parsoid's interpretation in other cases
  • gradually address technical debt in the corpus by deprecating broken behavior, fixing content via bots, editors, template authors .. whatever works best.
  • gradually shift behavior of wikitext and templates to be more determistic and structural (DOM_based)

We've come to this strategy based on what is doable and what is less likely to run into serious troubled waters and allowing all the wider set of dependent tools, etc. to migrate. Given how involved even Parsoid's data-mw split has been involving identifying processes, protocols, negotation about timing, this is understandable. Tidy replacement with a HTML5 parser is going to be one big accomplishment. But, as we've seen there are lots and lots of existing behavior that has to be accounted for and migrated / broken in careful ways. So, I don't see a strategy (given requirements earlier) that allows for a sudden consolidation around Parsoid or PHP parser (since that will have to implement all of Parsoid's functionality while being highly performant).

@ssastry, I'm mainly wondering

a) how far down the road towards maintaining two DOM-based parsers you are willing to walk, and
b) whether you still aim to make Parsoid a full replacement for the PHP parser by adding missing features like link updates and some hook support.

To me, the worst possible outcome appears to be paying for maintainance of two almost-identical parsers for a long time, but only having bi-directional conversion support in one of the two. I am a bit worried that we are currently drifting in this direction, but would be happy to learn that this is not the case, and you are actually aiming for one of the two main alternatives (PHP parser to replace Parsoid, or Parsoid to replace PHP parser).

To repeat, the concerns are not about having one parser on the WMF cluster serving both read and edit views, but about how to get to that goal.

The place we are moving towards is bringing the output of the two parsers closer together. That will enable Parsoid to replace the PHP parser on the WMF cluster without having to break a ton of stuff all at once. For example, all the stuff that breaks with replacing Tidy with HTML5Depurate will also break with Parsoid. By replacing Tidy with a full HTML5 parser, we get closer to being able to use Parsoid output for read views as well. So, yes the plan for the WMF cluster is still to have Parsoid be the default engine for read and edit views.

@ssastry, thanks for the clarification.

Do you plan to implement new DOM-based logic like section wrapping in both Parsoid and PHP?

Can you take the broader discussion somewhere else please? Maybe file a separate task for it?

The changes to the core parser will require equivalent changes (which will hopefully be a cleanup from the current ugliness in paragraph-wrapper) in Parsoid. We should co-ordinate these changes a bit. Stuff like https://github.com/wikimedia/parsoid/blob/2f68a9cb504a332ea2dec3ea948ebfcffcc4d141/lib/config/WikitextConstants.js#L123-L158 will likely change / be removed.

cscott added a comment.Jun 8 2016, 9:26 PM

Given that we will always have some edge cases in p-wrapping behavior (whether done as regexps on a string, or done based on stack emulations on a token stream), I personally don't think this task should be a blocker for replacing Tidy as long as (1) we can determine that a small subset of page renderings will break (2) we can identify those and (3) make fixes to the affected templates / page wikitext.

Since <p>-wrapping is not actually necessary for HTML5 validity any more -- it's just a matter of "legacy" compatibility with current stylesheets/gadgets/etc -- I'd like to see if we can gradually deprecate it. In ten years, I'd love for us to be at the point where we don't do <p>-wrapping at all, it would simplify a lot of things. In the near term, I agree that we can not stress about the edge cases too much, and then hopefully gradually turn off <p>-wrapping in some of the stranger edge cases, with appropriate notice/community involvement/etc.

Another weirdness that I discovered caused by this behavior:

<span style="color:red">
*a
*b
</span>

generates

<p><span style="color:red"></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></span></p>

Tidy takes and for some unknown reason converts that to:

<ul>
<li><span style="color:red">a</span></li>
<li><span style="color:red">b</span></li>
</ul>

whereas Parsoid and RemexHTML convert that to:

<p><span style="color:red"></span></p>
<ul>
<li>a</li>
<li>b</li>
</ul>
<p></p>

So, it looks like doBlockLevels weirdness combines with Tidy weirdness to generate the intended result. But, there is no such cancellation effect with Parsoid and Remex. So, either the wikitext would have to be fixed or doBlockLevels would have to be fixed.

This can be fixed in Parsoid by changing the p-wrapping algorithm and/or doing it on the DOM.

ssastry added a comment.EditedApr 13 2017, 12:31 PM

So, I think this bug is restricted to <span>.

Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Parsoid) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Tidy vs RemexHTML). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.

I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, I think this bug is restricted to <span>.
Compare https://www.mediawiki.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FSandbox/2444123 (Tidy vs RemexHTML) and https://www.mediawiki.org/w/index.php?title=User:SSastry_(WMF)/Sandbox&oldid=2444123&action=parsermigration-edit (Parsoid). So, both RemexHTML and Parsoid differ from Tidy only in the <span> tag case.
I think this is because we did add p-wrapping compatibility to RemexHTML. And, Parsoid also has a DOM pass for similar compatibility. *BUT*, I think both of these use "isFormattingTag(..)" as a check. However, <span> is not a formatting tag in the HTML5 spec. See the comment in isSplittableTag in https://gerrit.wikimedia.org/r/#/c/328021/4/pwrap.js which is the algorithm that RemexHTML effectively implements. So, this is not as bad as we thought. If we wanted to fix this, we could add explicit checks for <span> (and any other tag affected by this) to Parsoid and RemexHTML OR we could break Tidy compatibility here and have wikitext be fixed up.

So, at least for Parsoid, this is not because of anything that Parsoid does special ... but the result of domino's standard html5 tree building algorithm. So, if we want to fix this in Parsoid, we need to add special-case handling.

Nirmos added a subscriber: Nirmos.Jun 28 2017, 6:31 AM

Is this the same as T123023?

Is this the same as T11207?

ssastry lowered the priority of this task from Normal to Low.Apr 11 2018, 8:23 PM
ssastry added subscribers: Tgr, Ltrlg, matmarex and 2 others.

We won't fix this in the PHP parser since it requires DOM context which it doesn't have. In any case, since the goal is to make Parsoid the wikitext parser for mediawiki, and Parsoid can handle this better, there isn't a reason to hack this in the PHP parser at this point. But, leaving this open if someone feels an itch and wants to tackle it. But, I'll lower the priority on this.

This comment was removed by ssastry.
Izno added a subscriber: Izno.Jul 14 2019, 1:39 AM

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

$ (echo '{{left margin|5em|{{smaller|' ; echo 'Foo<br/>' ; echo 'Bar<br/>' ; echo 'Bat}}}}') | bin/parse.js --domain en.wikisource.org --usePHPPreProcessor=false
[...]
<div style="margin-left:5em;" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","a":{"style":"margin-left:5em;"},"sa":{"style":"margin-left:{{{1}}};"},"dsr":[0,54,null,null],"pi":[[{"k":"1"},{"k":"2"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"left margin","href":"./Template:Left_margin"},"params":{"1":{"wt":"5em"},"2":{"wt":"{{smaller|\nFoo&lt;br/>\nBar&lt;br/>\nBat}}"}},"i":0}}]}'>
<p><span style="font-size: 83%;">
Foo<br/>
Bar<br/></span></p>
Bat</div><link rel="mw:PageProp/Category" href="./Category:Pages_using_left-margin_with_text" about="#mwt1" data-parsoid='{"stx":"simple","a":{"href":"./Category:Pages_using_left-margin_with_text"},"sa":{"href":"Category:Pages using left-margin with text"}}'/>
</body></html>

We also migrate the </p> before Bat. Alas.

Unfortunately (?) Parsoid is now bug-compatible with PHP in this regard (example from T227953):

Absolutely, we had to. We worked hard to get there. :-) But, once we switch over to Parsoid, we can lint pages and undo.

Is there a bug or FIXME in Parsoid which indicates where this crazy behavior is implemented aka where to emit lint messages and then eventually where to remove the back-compat code?

All in ParagraphWrapper token transformer code.