Replace Tidy in MW parser with HTML 5 parse/reserialize
Open, NormalPublic

Description

Since the effect of running Tidy on MW Parser main pass output is poorly specified, I suggest parsing the MW Parser output using the HTML 5 algorithm and then reserializing the DOM for output.

This is what Parsoid is already doing, and Gabriel reports that the behaviour is similar to Tidy.

MWTidy::tidy() would become an abstract wrapper for the following backends:

  • External tidy
  • Internal tidy
  • New web service (Html5Depurate)
  • Existing pure-PHP code in Parser.php around line 1326, labelled "bug #2702"
  • Future pure-PHP code. When a compliant pure-PHP HTML 5 parser becomes available, it could be used as a low-performance backend to replace the bug 2702 code.

A new configuration variable has been introduced to control backend selection ($wgTidyConfig).

if ( $wgUseTidy ) {
  $wgTidyConfig = array(
     'cmd' => $wgTidyBin
  );
}

See: https://www.mediawiki.org/wiki/Parsing/Replacing_Tidy

Update June 2016

Backend abstraction is complete. The new web service (Html5Depurate) is basically complete. Packages are available in apt.wikimedia.org. Tim is working on a pure PHP equivalent.

We created a testing system which renders a large sample of articles with both Tidy and Depurate, generates screenshots, and compares the results visually.

In order to reduce the number of visible differences for an initial deployment, we added a "compatibility" endpoint to the Depurate API, which mimics Tidy's p-wrapping behaviour, and marks empty li, p and tr elements with a class so that they can be hidden with CSS.

Despite this, we still see significant differences, such as:

  • Navbox lists composed of nowrap spans sometimes end up being completely nowrapped, running off the right margin, either due to editor error or a MediaWiki parser bug which generates invalid HTML.
  • Active formatting element (AFE) reconstruction causes certain unclosed tags such as <i> to run on to the end of the page instead, instead of running on to the end of the enclosing element.

The main question now is: what should our deployment plan be?

  • Are we close enough now in visual diff testing to call that part of the project done? (96.79% showed less than 1% differences, 93.35% rendered with pixel-perfect accuracy.)
  • What tools should we provide to editors to migrate the remaining broken pages? Some issues (e.g. adjacent nowrap spans) are difficult to detect automatically.

Related Objects

StatusAssignedTask
OpenNone
ResolvedArlolra
OpenNone
OpenNone
OpenNone
Openssastry
Opentstarling
Resolvedssastry
Resolvedbd808
OpenNone
OpenNone
Resolvedcscott
OpenNone
OpenNone
Opentstarling
Resolvedtstarling
ResolvedElitre
OpenNone
There are a very large number of changes, so older changes are hidden. Show Older Changes

With whitespace and tbody normalization, and serialization differences resolved, here are the remaining parser test failures. They are grouped by the tidy behaviour that html5depurate does not currently support, with tests appearing in multiple categories when appropriate:

p-wrapping:

  • Non-word characters don't terminate tag names + tidy
  • Block tag on one line (<div>)
  • Block tag on one line (<blockquote>)
  • Block tag on both lines (<div>)
  • Block tag on both lines (<blockquote>)
  • Multiple lines without block tags
  • Empty lines between lines with block tags
  • Bug 15491: <ins>/<del> in blockquote (2)
  • 3a. Indent-Pre and block tags (single-line html)
  • 3b. Indent-Pre and block tags (multi-line html)
  • 4. Indent-Pre and extension tags
  • Implicit <td> after a |-
  • <pre> tags should be recognized in an explicit <td> context, but not in an implicit <td> context
  • Parsoid: Round-trip tables directly followed by content (bug 51219)
  • Horizontal ruler (should it add that extra space?)
  • Horizontal ruler -- Supports content following dashes on same line
  • Template with thumb image (with link in description)
  • Templates: 2. Inside a block tag
  • Templates: P-wrapping: 1c. Templates on consecutive lines
  • Templates: Wiki Tables: 1a. Fostering of entire template content
  • Templates: Wiki Tables: 2. Fostering of partial template content
  • Templates: Ugly nesting: 4. Divs opened/closed across templates
  • Image with link tails
  • Media link with nasty text
  • Closing of closed but not open table tags
  • Fuzz testing: Parser14
  • Fuzz testing: URL adjacent extension (no space, dirty; pre)
  • Nesting tags, paragraphs on lines which begin with <div>
  • Bug 6200: paragraphs inside blockquotes (no extra line breaks)
  • Bug 6200: paragraphs inside blockquotes (extra line break on close)
  • Bug 33845 (2): Headings become bold in TOC when they contain a blockquote
  • HRs: 1. Single line

Empty element removal:

  • Isolated close tags should be treated as literal text
  • Empty lines between lines with block tags
  • Empty pre; pre inside other HTML tags (bug 54946)
    • Removes empty pre tag per description
  • Block tag pre
  • Lists should be recognized in an implicit <td> context
    • Removes empty table element
  • Test the li-hack (The PHP parser relies on Tidy for the hack)
    • Removes empty <li> elements
  • Unbalanced closing non-block tags don't break a list (php parser relies on Tidy to fix up)
  • 1. List embedded in a formatting tag
  • 2. List embedded in a formatting tag
  • Sanitizer: Closing of closed but not open tags
  • Fuzz testing: Parser14
  • Nesting tags, paragraphs on lines which begin with <div>

Whitespace transformation newline -> space

  • Multiple lines without block tags
  • Templates: Don't escape already nowiki-escaped text in template parameters
  • Templates: Preserve blank parameter names in other positions
  • Image with link tails
  • Bug 6200: paragraphs inside blockquotes (no extra line breaks)

Whitespace collapse

  • Indent-Pre: Newlines in comments shouldn't affect sol state

Repair of dl/dd/dt content model (note: expected output has fixme comment)

  • Definition Lists: Nesting: Test 2 (Parsoid only)
  • Definition Lists: Nesting: Test 3 (Parsoid only)
  • Definition Lists: Mixed Lists: Test 1
  • Definition Lists: Mixed Lists: Test 11
  • Definition Lists: Weird Ones: Test 1
  • Definition Lists: colons occurring in tags

Repair of ul/ol/li content model:

  • Multiple list tags generated by templates
  • Unbalanced closing block tags break a list (php parser relies on Tidy to fix up)

Different resolution of mismatched start/end tags

  • Unbalanced closing non-block tags don't break a list (php parser relies on Tidy to fix up)
  • 1. List embedded in a formatting tag
  • 2. List embedded in a formatting tag

Repair of block element inside inline element by split/duplication of inline element (even with duplicated ID attribute!):

  • Bug 33845 (2): Headings become bold in TOC when they contain a blockquote
  • Multiple tags in TOC

Repair of <a> content model per HTML 4:

  • Media link with nasty text

Insertion of empty <p> for no reason:

  • Bug 33845 (2): Headings become bold in TOC when they contain a blockquote
  • Multiple tags in TOC
  • Empty <p> tag in TOC, removed by Sanitizer (T92892)

Notable test cases with whitespace marked up, aligned with additional spaces for comparison:

Unbalanced closing non-block tags don't break a list

in    <p><span>¶</p>¶       <ul> <li>      a</span><span></li>¶<li>      b</span></li> </ul>
tidy                        <ul>¶<li><span>a</span>      </li>¶<li><span>b</span></li>¶</ul>
html5 <p><span>¶</span></p>¶<ul> <li>      a<span></span></li>¶<li>      b       </li> </ul>


Unbalanced closing block tags break a list

in    <div>¶<ul> <li>a</div><div></li>¶             <li>b</div></li> </ul>
tidy  <div>¶<ul>¶<li>a</li>¶</ul>¶</div>¶<div>¶<ul>¶<li>b      </li>¶</ul>¶</div>
html5 <div>¶<ul> <li>a</li> </ul> </div> <div>¶     <li>b      </li>       </div>

1. List embedded in a formatting tag

in    <p><small>¶</p>               ¶<ul> <li>·foo              </li> </ul>¶<p></small>¶</p>
tidy                                 <ul>¶<li><small>foo</small></li>¶</ul>
html5 <p><small>¶</small></p><small>¶<ul> <li>·foo              </li> </ul>¶</small><p><small></small>¶</p>

2. List embedded in a formatting tag

in    <p><small>¶</p>                       ¶<ul> <li>       a</li>¶        <li>       b</small></li></ul>
tidy                                         <ul>¶<li><small>a</small></li>¶<li><small>b</small></li>¶</ul>
html5 <p><small>¶</small></p><small>¶</small><ul> <small><li>a</li>¶</small><li><small>b</small></li></ul>

Bug 33845 (2): Headings become bold in TOC when they contain a blockquote

in            <div·id="toc"·class="toc"> <div·id="toctitle"> <h2>Contents</h2> </div>¶<ul>¶<li·class="toclevel-1·tocsection-1"><a·href="#Quote"><span·class="tocnumber">1</span>·<span·class="toctext">Quote</span></a></li>¶</ul>¶</div>¶       ¶<h2><span·class="mw-headline"·id="Quote">             <blockquote>Quote                                                     </blockquote></span><span·class="mw-editsection"><span·class="mw-editsection-bracket">[</span><a·href="/index.php?title=Main_Page&amp;action=edit&amp;section=1"·title="Edit·section:·Quote">edit</a><span·class="mw-editsection-bracket">]</span></span></h2>
tidy  <p></p>¶<div·id="toc"·class="toc">¶<div·id="toctitle">¶<h2>Contents</h2>¶</div>¶<ul>¶<li·class="toclevel-1·tocsection-1"><a·href="#Quote"><span·class="tocnumber">1</span>·<span·class="toctext">Quote</span></a></li>¶</ul>¶</div>¶<p></p>¶<h2><span·class="mw-headline"·id="Quote"></span></h2>¶<blockquote>¶<p><span·class="mw-headline"·id="Quote">Quote</span></p>¶</blockquote>¶<p>   <span·class="mw-editsection"><span·class="mw-editsection-bracket">[</span><a·href="/index.php?title=Main_Page&amp;action=edit&amp;section=1"·title="Edit·section:·Quote">edit</a><span·class="mw-editsection-bracket">]</span></span></p>
html5         <div·id="toc"·class="toc"> <div·id="toctitle"> <h2>Contents</h2> </div>¶<ul>¶<li·class="toclevel-1·tocsection-1"><a·href="#Quote"><span·class="tocnumber">1</span>·<span·class="toctext">Quote</span></a></li>¶</ul>¶</div>¶       ¶<h2><span·class="mw-headline"·id="Quote">             <blockquote>Quote                                                     </blockquote></span><span·class="mw-editsection"><span·class="mw-editsection-bracket">[</span><a·href="/index.php?title=Main_Page&amp;action=edit&amp;section=1"·title="Edit·section:·Quote">edit</a><span·class="mw-editsection-bracket">]</span></span></h2>

Some thoughts:

Pretty much all of these tidy behaviours can theoretically have user-visible effects if the resulting elements are styled. The main question is whether they represent good, usable input markup going forward.

I think Tidy's whitespace handling is generally wrong and can be discarded. In cases where it is user-visible, it is probably harmful. This would fix T20851, T31252, T40800.

p-wrapping is one I need to think about.

p-wrapping advantages:

  • Better machine readability? The HTML 5 algorithm for paragraph detection is complex.
  • More elements for the user to customise with CSS
  • Backwards compatibility.

p-wrapping disadvantages:

  • Would need to be implemented.
  • In HTML 5, a paragraph cannot necessarily be represented as a document subtree since <ins> and <del> elements can straddle paragraph boundaries. So in principle, a paragraph is purely a layout concept, not a DOM concept.

Empty element removal: again this is a case of Tidy assuming it understands HTML when it doesn't. I think getting rid of empty element removal would be beneficial. It would probably fix T7859, T29786, T49673, T87414.

Duplication of inline elements that wrap block elements is obviously not a good idea. Getting rid of this behaviour would fix T11737, maybe T11661.

Treating <a> as an inline element is obviously not a good idea, this is T73962.

Majr added a subscriber: Majr.Sep 8 2015, 2:37 PM

For the "List embedded in a formatting tag" tests, T92040 is relevant which also happens to be bit of a head-scratcher -- Parsoid (and HTM5depurate) emit well-formed HTML that the tree builder accepts, but a W3C validator whines about. Our current thinking about that is T92040#1106499.

I ask this out of complete ignorance, because as far as I can tell there is no mention here of the newer HTML5 Tidy effort.

Has anyone looked at http://www.html-tidy.org/ ??

Is this the code that is under consideration or is the tidy referred to here the stale code on sourceforge?

"HTML tidy" was released Sep 4, so it didn't exist when this task was created. So certainly a look is probably in order.

That said, one of the goals here is to (a) deliberately "do less" -- that is, tidy should be just an HTML5 parse/reserialize, and not attempt to do all the crazy "clean up" things which the original tidy did (like munging whitespace all over the place to make the output "look prettier"), and (b) actually document exactly what tidy does (that is, if we need to do anything more than simple parse/reserialize, write that behavior up as a proper spec). http://www.html-tidy.org/documentation/ is an improvement on "old tidy", but is still a long way from a proper behavior specification.

GWicke added a comment.EditedSep 28 2015, 4:03 PM

@MarkAHershberger, I have been following https://github.com/htacg/tidy-html5 since its start. Their goals seem to be just a fix-up for HTML5 tags, which as @cscott says is not quite what we are looking for.

That said, one of the goals here is to (a) deliberately "do less" -- that is, tidy should be just an HTML5 parse/reserialize,

(Did you intend to have a "(b)" somewhere?)

It looks like the html-tidy effort will provides an interface for parsing/reserializing. I'm not familiar with all the needs here, but it looks like that could be an answer.

If it isn't what is needed -- maybe you want a DOM tree produced and this wouldn't do that right now -- then with active developers it certainly would be worth asking them if they could implement what is needeed.

@MarkAHershberger, tidy always parsed & re-serialized; the issue is about what kind of munging it's doing in between.

cscott added a comment.EditedSep 28 2015, 4:43 PM

There is a (b) up there.

I do expect that as time goes on we will want to do more manipulations in mediawiki-core on the DOM. My next goal would be to implement the sanitizer (or as much as possible of it) as a well-specified set of transformations on the DOM, instead of (as it is currently) an adhoc set of regular expressions.

The documentation you link is actually to old tidy, I believe. The PHP interface is code we wrote in-house, IIUC. The tidy Node interface was "much like" the W3C DOM, but not actually the same thing. That's exactly the sort of ad-hoc-ism we would like to eliminate.

@GWicke, and, from what I can see of tidy's history, it looks like getting those problems addressed in tidy wasn't going to happen since the last release for tidy was eons ago. With a living community working on tidy, would it be reasonable to ask them to address the issues? Does the new tidy have those same issues?

cscott added a comment.EditedSep 28 2015, 4:44 PM

@MarkAHershberger The new tidy is certainly better than old tidy. But what mediawiki really wants is no tidy at all.

(Which is to say: no ad hoc manipulations of output, beyond what is in the HTML5 spec.)

There is a (b) up there.

Sorry, I didn't use find and my eyes skipped over it.

GWicke added a comment.EditedSep 28 2015, 4:46 PM

Does the new tidy have those same issues

Basically, yes. Last I looked, it was still not using the HTML5 parsing algorithm, and still implements all kinds of heuristics that are doing more bad than good.

Does the new tidy have those same issues

Basically, yes. Last I looked, it was still not using the HTML5 parsing algorithm, and still implements all kinds of heuristics that are doing more bad than good.

And, again, look at T4542: [DO NOT USE] HTML Tidy issues (tracking) [superseded by the #Tidy tag] to understand the issues that ad hoc parsing and manipulation have sown for us. We need to chop that tree down, not prune it slightly.

The documentation you link is actually to old tidy, I believe.

It is. And it doesn't look like the w3c DOM. AFAICT, they've simply said that they have a compatible interface.

The PHP interface is code we wrote in-house, IIUC. The tidy Node interface was "much like" the W3C DOM, but not actually the same thing. That's exactly the sort of ad-hoc-ism we would like to eliminate.

This seems like the perfect time to ask them for a W3C DOM and get rid of the adhocism.

@MarkAHershberger The thing you would like them to do does not seem like the thing they are doing. But it's time to open a new task (if you like) and redirect the discussion there. The task description here is "replace tidy with HTML5 parse/reserialize", not "update tidy".

ssastry added a comment.EditedSep 28 2015, 4:56 PM

@MarkAHershberger, Given a string, you need a HTML5 parser to parse it into a DOM -- that and nothing less and nothing more. HTML5depurate is one such solution. The old Tidy is not (the new Tidy may get there, but at that point, we should be able to replace one html5 parser with another fairly easily).

Elitre added a subscriber: Elitre.Nov 19 2015, 4:07 PM

https://gerrit.wikimedia.org/r/279669 has some code which might be relevant to a pure-PHP implementation. Still feeling things out with that code though; I'm trying to avoid a full HTML5 tokenizer pass if I can.

With the new compatible mode (after https://gerrit.wikimedia.org/r/#/c/281865/ ), the number of parser test failures post-normalization is reduced from 100 to 47.

Visual diff testing flagged a case which was not covered by parser tests: <b/> is treated by the HTML 5 parsing algorithm as identical to <b> except with a parse error emitted, whereas Tidy treats it as a self-closing tag like in XML. Such tags are used by the {{hands}} template on enwiki.

cscott added a comment.May 3 2016, 8:11 PM

From discussion on wiki:

Editors have been using null tags for years like "<b/>" or "<span/>" (beyond null nowiki, "<nowiki/>") as escape-tokens to allow lead/trailing-spaces or leading semicolon ";" with only 4-to-7-character tokens rather than the 40-to-80-character nowiki tags, to avoid extra bytes in each template, with the wp:expansion depth limit being only 2,000 kb rather than 2.5 mb or more. -Wikid77 (talk) 17:55, 3 May 2016 (UTC)

So it's a workaround for T14974? (Implicit newline insertion before * # : ; {|) Obviously <nowiki/> is the "right" way to do that, I guess (if we can't think of something better) -- I wonder if there's anyway to tell how many pages would really be broken (read, exceed the expansion depth limit) by replacing <b/> with <nowiki/>? Perhaps we could bump the limit from 2MB to 2.5MB at the same time as we get rid of the hack? C. Scott Ananian (talk) 19:58, 3 May 2016 (UTC)

This case could easily be cleaned up in the Sanitizer before the wikitext gets to tidy. Depending on whether we want to encourage this or not, we could:

  1. Strip <TAG/>, where TAG is not in a small whitelist. This moves the behavior from tidy to the Sanitizer, but means it's not an official "feature" of mediawiki.
  2. Replace "<TAG/>" with "&lt;TAG/>", if TAG is not in the small whitelist, encouraging template authors to use "<nowiki/>" instead. (We could even help with this initial conversion, the Sanitizer patch would just prevent the problem from recurring.)
  3. Replace "<TAG/>" with "<TAG>". This would be consistent with HTML5 parsing semantics and thus leave less "special case cruft" in the mediawiki codebase long-term, but would probably be uglier in the short term, as we'd have escaping boldface everywhere. This would also be the behavior if we did nothing, but eventually replaced tidy with depurate.
  4. Other options? Maybe emit a warning category, to aid in cleanup?

Part of solving this is probably ensuring we've got the right tools to migrate existing wikitext to match whatever change we make.

cscott edited the task description. (Show Details)May 4 2016, 7:56 PM
brion added a subscriber: brion.Jun 1 2016, 8:27 PM
tstarling edited the task description. (Show Details)Jun 7 2016, 1:23 AM

The ArchCom-RFC office hour today (E203) was dedicated to this. Summary is captured in the description of E203, and the full transcript is captured at P3228. Much of the meeting was spent discussing alternative approaches to Html5Depurate, with the clarification that it is still the plan of record.

The plan (subject to modification based on initial meetings and experience):

  • Meeting with Operations about Html5Depurate instances
  • Meeting with Community-Liaisons about rollout strategy
  • Rollout Html5Depurate instances
  • Rollout special page+gadget
  • Publicize the migration + enlist help in identifying showstoppers
  • Rollout full Tidy->Html5Depurate transition on first wikis
  • Roll out further based on initial results

@GWicke made the point that third party deployments need to be considered sooner rather than later, but we tabled that part of the conversation in this meeting.

Status of this RFC (from my understanding): this is not "approved" yet, but is "in progress" (see T137860 for what "in progress" means)

JJMC89 added a subscriber: JJMC89.Sep 22 2016, 10:06 PM
Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 10:25 PM
Pchelolo edited projects, added Services (watching); removed Services.