Page MenuHomePhabricator

Parsoid and Tidy differ in how they deal with misnested tags
Closed, DeclinedPublic

Description

I edited https://www.mediawiki.org/w/index.php?title=MediaWiki-Vagrant&oldid=946962 with VisualEditor. Everything after the "Quick Start" section appeared in monospace.

I repro'd with the smaller test page https://www.mediawiki.org/wiki/User:S_Page_(WMF)/VE_span_bug

I think the issue lies in

<p style="font-size: 1.2em; margin-top: 1.2em;">When Vagrant is done configuring your machine, browse to <span class=plainlinks>http://127.0.0.1:8080/</span> to find your MediaWiki instance. The admin password is <code>vagrant<code>.</span>

Note its incorrect second opening <code> instead of closing the code, and a second closing </span>. Yet regular wiki parsing can handle it.


Version: unspecified
Severity: normal
URL: https://www.mediawiki.org/wiki/User:S_Page_(WMF)/VE_span_bug

Details

Reference
bz63699

Related Objects

Event Timeline

bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz63699.
Spage created this task.Apr 8 2014, 8:28 PM

(In reply to spage from comment #0)

Note its incorrect second opening <code> instead of closing the code, and a
second closing </span>. Yet regular wiki parsing can handle it.

This is Tidy doing the cleanup. PHP parser does not do the fixup automatically. This is a known issue -- we have a bunch of these scenarios now where Parsoid and Tidy fixup broken code differently. We'll never match Tidy behavior (which is not always the desired behavior in any case). But, as for this specific scenario, fixing this will require us to use heuristics in the tokenizer and more complexity. At this time, it is not clear if the additional complexity is worth it, except for easy to auto-fix scenarios.

  • Bug 63798 has been marked as a duplicate of this bug. ***
Arlolra removed GWicke as the assignee of this task.Nov 25 2014, 7:56 PM
Arlolra set Security to None.
Arlolra triaged this task as Normal priority.Nov 26 2014, 11:56 PM
Arlolra added a subscriber: Arlolra.
ssastry moved this task from Backlog to Robustness on the Parsoid board.Dec 22 2014, 12:37 AM
marcoil moved this task from Robustness to Backlog on the Parsoid board.Feb 13 2015, 12:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 4:50 PM

fwiw, one of the people behind tidy-html5 says he tested and found that this issue was fixed.

fwiw, one of the people behind tidy-html5 says he tested and found that this issue was fixed.

Thanks @MarkAHershberger. What we are getting at with the Tidy replacement and which applies here as well is that we really don't need Tidying or any such thing in mediawiki. What we need is a HTML5 parser (something that takes a string and parses it according to the well-specified HTML5 parsing algorithm).

We don't need all the extra things that the old Tidy did. So, if the new tidy-html5 project implemented what we needed, it would simply be a HTML5 tree builder that takes a string, and builds a DOM. If we needed (for backward compatibility reasons) to implement additional post-processing on the DOM (to do some of the things the old/in-production Tidy does), then we would need a full DOM API. But, if we determine as part of the work in T89331 that we can drop all the extra things the old Tidy does, then we may be able to not bother with the DOM API, as long as the tool generated a faithful serialization of the parsed DOM tree.

T89331 does not mandate an external HTML5 parsing service. If we determine that tidy-html5 is "good enough", then, it would be an acceptable fallback for mediawiki installations that are happy with it. In fact, wikis don't even need to replace the "old tidy" if they don't care about all the bugs filed against it. Part of the impetus to T89331 is to pay down technical debt. It also moves PHP parser and Parsoid output closer. It also brings us closer to reducing complexity in Parsoid, and eventually in the glorious future, to maybe have a simple bidirectional parser entirely in core without need for an external service. We can chat more on IRC if you have more questions. I didn't add this explanation to T89331 since that ticket is already very long with a lot of discussion.

ssastry closed this task as Declined.Dec 18 2017, 10:16 PM

Tidy is on the way to being removed next year. Parsoid and RemexHTML will handle broken HTML markup much more similarly since they both using HTML5 tree building to fix up broken markup.