Whole sections put into one line
Open, NormalPublic

Description

See this edit, where whole sections have been put into a single line by VE (including the titles of several sections). Also, look at the mess done on the categories.

NicoV created this task.Dec 27 2015, 11:11 AM
NicoV updated the task description. (Show Details)
NicoV raised the priority of this task from to Needs Triage.
NicoV added projects: VisualEditor, Parsoid.
NicoV added a subscriber: NicoV.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 27 2015, 11:11 AM

So, I think there are two or three independent issues here.

Problem 1:

This is the Parsoid HTML for the revision that was edited in VE. It is clear that there is something broken there if you look at the list in the section "Acteurs principaux" and the rest of the page following that. Everything looks wrapped in a <small> tag.

Here is a wikitext snippet from that list in question.

* [[Barry Watson]] (V. F. : Vincent Barazzoni) : Matt Camden <small>(régulier saisons 1-6 et 9, invité saisons 7-8 et 10)
* [[Jessica Biel]] (V. F. : [[Julie Turin]]) : Mary Camden Rivera (régulière saisons 1-6, invitée saisons 7-8 et 10)
* [[David Gallagher]] (V. F. : [[Jimmy Redler]], [[Hervé Grull]] et [[Pascal Grull]]) : Simon Camden (saisons 1-10)
* [[Lorenzo Brino]] : Samuel « Sam » Camden (invité saisons 3-5, régulier saisons 6-11)
* [[Nikolas Brino]] : David Camden (invité saisons 3-5, régulier saisons 6-11)</small>
* [[George Stults]] (V. F. : [[Olivier Jankovic]])</small : Kevin Kinkirk (invité saison 6, régulier saisons 7-11)
* [[Adam LaVorgna]] (V. F. : [[Romain Redler]]) : Robbie Palmer (invité saison 4, régulier saisons 5-7)
* [[Tyler Hoechlin]] (V. F. : [[Tony Marot]]) : Martin Brewer <small>(saisons 8-11)
* [[Rachel Blanchard]] (V. F. : Dominique Vallée) : Roxanne Richardson (invitée puis régulière saisons 7-8)
* [[Ashlee Simpson]] (V. F. : Naïke Fauveau) : Cecilia Smith <small>(saisons 7-8)

In any case, Parsoid is unable to cleanly recover from this. Parsoid doesn't like unclosed <small> tags in a list item because a HTML5 tree builder doesn't like them.

If you open this HTML snippet in a browser (after saving it to a file),

<ul>
<li> a <small> b
<li> c
</ul>
<p>x</p>
<p>y</p>
<h2> Heading </h2>
<p>z</p>

you will see similar effects. In any case, looks like Tidy obviously does something very different which is why the effect of these markup errors don't propagate outside the list. Because of T4542: [DO NOT USE] HTML Tidy issues (tracking) [superseded by the #Tidy tag], we are currently working to replace Tidy as part of T89331. When that happens, this page will start looking broken without Parsoid as well.

If this is an one-off case, the simplest solution is to fix the wikitext. However, if this pattern is common, then we'll have to figure out a solution to this. The simplest solution would be run bots to fix this usage. But, if that solution doesn't work, then, there are other tricks up our sleeve. One of which is T114444: [RFC] Introduce notion of DOM scopes in wikitext which would also fix this problem through scoping. But, that is also some ways away. In the worst case, we'll have to implement a DOM pass in both the Tidy-replaced PHP parser and in Parsoid.

So, this first piece is all about the broken HTML.

Maybe Problem 2

I imagine the editing experience in VE is / was also horrible with this. Not sure what happens with extensive edits to this page in VE. Ideally nothing bad would happen -- it just happens to be differently rendering HTML. But, it would be good if some VE folks could do such extensive edits on this page and see what kind of HTML is generated for Parsoid. Anyway, we ned to dump the HTML out of VE to see that it looks reasonable given the edits.

Maybe Problem 3
I suspect this problem might be in Parsoid's HTML -> wikitext transformation code which got totally tripped up by the edited HTML. Now, whether this was a problem with the HTML that Parsoid got from VE or a problem with Parsoid's html -> wt code is unclear. Something that we'll know by trying some complex edits in VE with the old revision and then dumping the HTML for debugging.

ssastry updated the task description. (Show Details)Dec 28 2015, 4:51 PM
ssastry triaged this task as Normal priority.
ssastry set Security to None.
NicoV added a subscriber: Bgwhite.Dec 31 2015, 1:42 PM

@ssastry

Could Parsoid detect this kind of problem (Problem 1) and emit a warning in its ouput, so that the tool using Parsoid (VE here) would prevent the edit and give the user explanation of the kind of problem detected ? That would prevent such extensive damages and let users know about existing problems in articles.

For bots, maybe Check Wiki (or other bot tools) could detect unclosed small tags, @Bgwhite

Note that unclosed <small> (or any other) tags are not the problem per se. There are zillions of those on wiki pages and we handle them properly.

It is their presence in list items in this case that seems to be the problem. There might be ways in which we could detect when a tag is auto-closed "very far away" from the opening tag. If that tag happens to be a formatting tag like <small> or <big> or whatever else, we might be able to wrap that entire block in a marker that VE knows is uneditable along with some error info. So, we would add a mw:Error typeof (as in https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec#Error_handling ) and add more info in the json properties.

But, this is not going to happen right away.