Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool
Open, NormalPublic


Since the effect of running Tidy on MW Parser main pass output is poorly specified, I suggest parsing the MW Parser output using the HTML 5 algorithm and then reserializing the DOM for output.

This is what Parsoid is already doing, and Gabriel reports that the behaviour is similar to Tidy.

MWTidy::tidy() would become an abstract wrapper for the following backends:

  • External tidy
  • Internal tidy
  • New web service (Html5Depurate)
  • Existing pure-PHP code in Parser.php around line 1326, labelled "bug #2702"
  • Future pure-PHP code. When a compliant pure-PHP HTML 5 parser becomes available, it could be used as a low-performance backend to replace the bug 2702 code.

A new configuration variable has been introduced to control backend selection ($wgTidyConfig).

if ( $wgUseTidy ) {
  $wgTidyConfig = array(
     'cmd' => $wgTidyBin


Update June 2016

Backend abstraction is complete. The new web service (Html5Depurate) is basically complete. Packages are available in Tim is working on a pure PHP equivalent.

We created a testing system which renders a large sample of articles with both Tidy and Depurate, generates screenshots, and compares the results visually.

In order to reduce the number of visible differences for an initial deployment, we added a "compatibility" endpoint to the Depurate API, which mimics Tidy's p-wrapping behaviour, and marks empty li, p and tr elements with a class so that they can be hidden with CSS.

Despite this, we still see significant differences, such as:

  • Navbox lists composed of nowrap spans sometimes end up being completely nowrapped, running off the right margin, either due to editor error or a MediaWiki parser bug which generates invalid HTML.
  • Active formatting element (AFE) reconstruction causes certain unclosed tags such as <i> to run on to the end of the page instead, instead of running on to the end of the enclosing element.

The main question now is: what should our deployment plan be?

  • Are we close enough now in visual diff testing to call that part of the project done? (96.79% showed less than 1% differences, 93.35% rendered with pixel-perfect accuracy.)
  • What tools should we provide to editors to migrate the remaining broken pages? Some issues (e.g. adjacent nowrap spans) are difficult to detect automatically.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes has some code which might be relevant to a pure-PHP implementation. Still feeling things out with that code though; I'm trying to avoid a full HTML5 tokenizer pass if I can.

With the new compatible mode (after ), the number of parser test failures post-normalization is reduced from 100 to 47.

Visual diff testing flagged a case which was not covered by parser tests: <b/> is treated by the HTML 5 parsing algorithm as identical to <b> except with a parse error emitted, whereas Tidy treats it as a self-closing tag like in XML. Such tags are used by the {{hands}} template on enwiki.

cscott added a comment.May 3 2016, 8:11 PM

From discussion on wiki:

Editors have been using null tags for years like "<b/>" or "<span/>" (beyond null nowiki, "<nowiki/>") as escape-tokens to allow lead/trailing-spaces or leading semicolon ";" with only 4-to-7-character tokens rather than the 40-to-80-character nowiki tags, to avoid extra bytes in each template, with the wp:expansion depth limit being only 2,000 kb rather than 2.5 mb or more. -Wikid77 (talk) 17:55, 3 May 2016 (UTC)

So it's a workaround for T14974? (Implicit newline insertion before * # : ; {|) Obviously <nowiki/> is the "right" way to do that, I guess (if we can't think of something better) -- I wonder if there's anyway to tell how many pages would really be broken (read, exceed the expansion depth limit) by replacing <b/> with <nowiki/>? Perhaps we could bump the limit from 2MB to 2.5MB at the same time as we get rid of the hack? C. Scott Ananian (talk) 19:58, 3 May 2016 (UTC)

This case could easily be cleaned up in the Sanitizer before the wikitext gets to tidy. Depending on whether we want to encourage this or not, we could:

  1. Strip <TAG/>, where TAG is not in a small whitelist. This moves the behavior from tidy to the Sanitizer, but means it's not an official "feature" of mediawiki.
  2. Replace "<TAG/>" with "&lt;TAG/>", if TAG is not in the small whitelist, encouraging template authors to use "<nowiki/>" instead. (We could even help with this initial conversion, the Sanitizer patch would just prevent the problem from recurring.)
  3. Replace "<TAG/>" with "<TAG>". This would be consistent with HTML5 parsing semantics and thus leave less "special case cruft" in the mediawiki codebase long-term, but would probably be uglier in the short term, as we'd have escaping boldface everywhere. This would also be the behavior if we did nothing, but eventually replaced tidy with depurate.
  4. Other options? Maybe emit a warning category, to aid in cleanup?

Part of solving this is probably ensuring we've got the right tools to migrate existing wikitext to match whatever change we make.

cscott updated the task description. (Show Details)May 4 2016, 7:56 PM
brion added a subscriber: brion.Jun 1 2016, 8:27 PM
tstarling updated the task description. (Show Details)Jun 7 2016, 1:23 AM

The ArchCom-RFC office hour today (E203) was dedicated to this. Summary is captured in the description of E203, and the full transcript is captured at P3228. Much of the meeting was spent discussing alternative approaches to Html5Depurate, with the clarification that it is still the plan of record.

The plan (subject to modification based on initial meetings and experience):

  • Meeting with Operations about Html5Depurate instances
  • Meeting with Community-Liaisons about rollout strategy
  • Rollout Html5Depurate instances
  • Rollout special page+gadget
  • Publicize the migration + enlist help in identifying showstoppers
  • Rollout full Tidy->Html5Depurate transition on first wikis
  • Roll out further based on initial results

@GWicke made the point that third party deployments need to be considered sooner rather than later, but we tabled that part of the conversation in this meeting.

Status of this RFC (from my understanding): this is not "approved" yet, but is "in progress" (see T137860 for what "in progress" means)

JJMC89 added a subscriber: JJMC89.Sep 22 2016, 10:06 PM
Pchelolo moved this task from Backlog to watching on the Services board.Oct 12 2016, 10:25 PM
Pchelolo edited projects, added Services (watching); removed Services.
jrbs added a subscriber: jrbs.Apr 25 2017, 6:16 PM
Krinkle removed a subscriber: Krinkle.Apr 26 2017, 12:55 AM
ssastry renamed this task from Replace Tidy in MW parser with HTML 5 parse/reserialize to Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool.

The task summary is out of date since depurate is no longer being used and instead we're using Tim's pure-PHP RemexHtml library. Once T185753: MediaWiki should default to using RemexHtml for tidy is completed and all Wikimedia wikis are using Remex for tidy, I think we can consider this resolved.

Prod added a subscriber: Prod.Mon, Mar 5, 8:11 PM