Page MenuHomePhabricator

Consider not outputting content model violations to help editing clients
Open, MediumPublic

Description

For example,

'''[[File:Image.jpg|thumb]]'''

outputs,

<b data-parsoid='{"dsr":[0,30,3,3]}'><figure class="mw-default-size" typeof="mw:Image/Thumb" data-parsoid='{"optList":[{"ck":"thumbnail","ak":"thumb"}],"dsr":[3,27,2,2]}'><a href="./File:Image.jpg" data-parsoid='{"a":{"href":"./File:Image.jpg"},"sa":{"href":"File:Image.jpg"},"dsr":[5,25,null,null]}'><img resource="./File:Image.jpg" src="//upload.wikimedia.org/wikipedia/commons/thumb/7/78/Image.jpg/220px-Image.jpg" data-file-width="500" data-file-height="500" data-file-type="bitmap" height="220" width="220" data-parsoid='{"a":{"resource":"./File:Image.jpg","height":"220","width":"220"},"sa":{"resource":"File:Image.jpg"}}'/></a></figure></b>

and,

<small>
* hi
* ho
</small>

outputs,

<p data-parsoid='{"dsr":[0,7,0,0]}'><small data-parsoid='{"stx":"html","autoInsertedEnd":true,"dsr":[0,7,7,0]}'></small></p><small data-parsoid='{"stx":"html","autoInsertedEnd":true,"autoInsertedStart":true,"dsr":[7,17,0,0]}'>
<ul data-parsoid='{"dsr":[8,17,0,0]}'><li data-parsoid='{"dsr":[8,12,1,0]}'>hi</li>
<li data-parsoid='{"dsr":[13,17,1,0]}'>ho</li></ul></small>
<p data-parsoid='{"dsr":[18,26,0,0]}'><small data-parsoid='{"stx":"html","autoInsertedStart":true,"dsr":[18,26,0,8]}'></small></p>

While these both render identically to the PHP parser / Remex combo, they are uneditable in VisualEditor.

From the commit message of https://github.com/wikimedia/parsoid/commit/d473791ea982178af7a0fe15aff5cf8e21aaa5e8

However, in T68749, there's a request/argument to avoid nesting blocks
in formatting tags, since VE can't handle that data model violation and
marks the nodes as uneditable. In the past, we've supported that
inconsistently. For example, the case in "1. List embedded in a
formatting tag" is uneditable, despite it being a popular pattern.
In "2. Treebuilder fixup of formatting elt", only one of two images was
editable.

We should consider reopening T68749 and adding a pass to solve the issue
generally. Note that the Parsoid output now matches Remex so any
changes to the DOM is explicitly introducing a difference to support
editing clients, and should be noted as such.

Event Timeline

ssastry moved this task from Needs Triage to Linting on the Parsoid board.

From https://www.mediawiki.org/wiki/Parsing/Notes/HTML5_Compliance

The situation with 3. is a bit more complicated. Tidy does a better job of compliance with (3) than Parsoid or any of the proposed Tidy replacements (HTML5Depurate or RemexHTML). But, Tidy is HTML4 compliant and does too much, so emulating that is not the solution. The non-compliance in Parsoid, etc. exists because the HTML5 tree builder has to be more lenient in what it expects and so the serialize(parse(html5-string)) operation does not guarantee that content model constraints will be enforced. HTML5's tree builder algorithm used in parsing input strings is deliberately designed this way because of the vast source of non-compliant documents out there. So, we cannot rely on the tree builder to fix up content model constraints. If we wanted to ensure compliant output, we would have to either rely on a post-processor to fix up the output (more feasible) or never generate non-compliant in the first place (less feasible). With Parsoid, this post-processing pass is further complicated by the fact that this has to fix up DSR offsets as well as any other private round tripping information (much less serious going forward as we remove more and more of it) so that selective serialization continues to function properly.