Page MenuHomePhabricator

New high-priority Linter category for a subset of misnested tags whose behavior will change with a HTML5 parser
Closed, ResolvedPublic

Description

T176353: Parsoid closes unclosed <del> tag while Tidy doesn't highlighted one more source of differing behavior between a HTML4 and HTML5 parser.

But, this is a more general problem that is not limited to the <del> tag. Based on my understanding of the HTML5 spec, the following set of tags are going to be affected similarl: HTML5 tags - HTML4 block tags - HTML5 formatting tags - HTML5 void tags - those that won't be sanitized away - a few others.

Some of the tags in this category are: abbr, cite, del, sub, sup, span.

So, the linting rule is: If we have a tag in the above set, and it is auto-closed, then, page rendering is changed when Tidy is replaced.

Right now, this behavior is already tracked by Linter, but it will likely show up in the misnested-tag or in missing-end-tag categories. This ticket is a proposal to surface these in a category of its own since this will actually affect rendering.

Event Timeline

ssastry updated the task description. (Show Details)

Change 379430 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Detect misnested tags that have different behavior in HTML5 vs HTML4

https://gerrit.wikimedia.org/r/379430

Grr .. Remex, Domino, Tidy, and Parsoid (not just Domino) each do different things on the set of tags. Of course the Remex testing is mediated by PHP parser output, so, there is that as well. So, this makes for an interesting exploration of what the right behavior is. I used Parsoid output as the determiner of HTML5 spec behavior which is broken. So, I have to do more investigation here.

The set of tags I am currently investigating are: ABBR, BDI, BDO, CITE, DATA, DEL, DFN, INS, KBD, MARK, Q, RB, RP, RT, RTC, RUBY, SAMP, SPAN, SUB, SUP, TIME, VAR ... wrt to how rendering changes when they wrap html4-block tags or when they have missing end tags.

Okay, so, I tripped up myself because I was mixing mediawiki parser output and html5 fixups in my tests. But, it looks like Domino, Remex, and Firefox all agree on built output when I feed it specific kinds of HTML.

But, the linting that I want to identify here is specific to the fixups that Remex does when given certain kinds of misnested output (generated by the doBlockLevels code -- another symptom of T134469: doBlockLevels() inserts <p> and </p> randomly with no regard for HTML validity). Parsoid mimic this p-wrapping behavior so Parsoid does similar kinds of fixups. However, Tidy does something different. Given that wikipedias have been built with that output, we need to fix it.

Specifically on wikitext of this form:

<small>a

b</small>

PHP parser (and Parsoid) generates the following output:

<p><small>a</p>

<p>b</small></p>

Tidy generates

<p><small>a</small></p>
<p><small>b</small></p>

The AFE reconstruction algorithm in HTML5 tree building also does a similar thing => Both Remex and Parsoid generate the same output as above.

However, consider this wikitext:

<span>a

b</span>

PHP parser (and Parsoid) generates the following output:

<p><span>a</p>

<p>b</span></p>

Tidy generates

<p><span>a</span></p>
<p><span>b</span></p>

But, a HTML5 parser (and hence Remex and Parsoid) generates:

<p><span>a</span></p>
<p>b</p>

See how the second paragraph doesn't get the span tag whereas Tidy adds it there as well. So, all the above tags are subject to this behavior and that is what this linter category tracks. I will verify that all the 22 tags above are actually subject to this behavior.

Confirmed with the list of 22 tags above:
Input wikitext

<abbr>x

y</abbr>

<bdi>x

y</bdi>

<bdo>x

y</bdo>

<cite>x

y</cite>

<data>x

y</data>

<del>x

y</del>

<dfn>x

y</dfn>

<ins>x

y</ins>

<kbd>x

y</kbd>

<mark>x

y</mark>

<q>x

y</q>

<rb>x

y</rb>

<rp>x

y</rp>

<rt>x

y</rt>

<rtc>x

y</rtc>

<ruby>x

y</ruby>

<samp>x

y</samp>

<span>x

y</span>

<sub>x

y</sub>

<sup>x

y</sup>

<time>x

y</time>

<var>x

y</var>

Tidy output

<p><abbr>x</abbr></p>
<p><abbr>y</abbr></p>
<p><bdi>x</bdi></p>
<p><bdi>y</bdi></p>
<p><bdo>x</bdo></p>
<p><bdo>y</bdo></p>
<p><cite>x</cite></p>
<p><cite>y</cite></p>
<p><data>x</data></p>
<p><data>y</data></p>
<p><del>x</del></p>
<p><del>y</del></p>
<p><dfn>x</dfn></p>
<p><dfn>y</dfn></p>
<p><ins>x</ins></p>
<p><ins>y</ins></p>
<p><kbd>x</kbd></p>
<p><kbd>y</kbd></p>
<p><mark>x</mark></p>
<p><mark>y</mark></p>
<p><q>x</q></p>
<p><q>y</q></p>
<p><rb>x</rb></p>
<p><rb>y</rb></p>
<p><rp>x</rp></p>
<p><rp>y</rp></p>
<p><rt>x</rt></p>
<p><rt>y</rt></p>
<p><rtc>x</rtc></p>
<p><rtc>y</rtc></p>
<p><ruby>x</ruby></p>
<p><ruby>y</ruby></p>
<p><samp>x</samp></p>
<p><samp>y</samp></p>
<p><span>x</span></p>
<p><span>y</span></p>
<p><sub>x</sub></p>
<p><sub>y</sub></p>
<p><sup>x</sup></p>
<p><sup>y</sup></p>
<p><time>x</time></p>
<p><time>y</time></p>
<p><var>x</var></p>
<p><var>y</var></p>

Remex output (reformatted; matches Parsoid output)

<p><abbr>x</abbr></p>
<p>y</p>
<p><bdi>x</bdi></p>
<p>y</p>
<p><bdo>x</bdo></p>
<p>y</p>
<p><cite>x</cite></p>
<p>y</p>
<p><data>x</data></p>
<p>y</p>
<p><del>x</del></p>
<p>y</p>
<p><dfn>x</dfn></p>
<p>y</p>
<p><ins>x</ins></p>
<p>y</p>
<p><kbd>x</kbd></p>
<p>y</p>
<p><mark>x</mark></p>
<p>y</p>
<p><q>x</q></p>
<p>y</p>
<p><rb>x</rb></p>
<p>y</p>
<p><rp>x</rp></p>
<p>y</p>
<p><rt>x</rt></p>
<p>y</p>
<p><rtc>x</rtc></p>
<p>y</p>
<p><ruby>x</ruby></p>
<p>y</p>
<p><samp>x</samp></p>
<p>y</p>
<p><span>x</span></p>
<p>y</p>
<p><sub>x</sub></p>
<p>y</p>
<p><sup>x</sup></p>
<p>y</p>
<p><time>x</time></p>
<p>y</p>
<p><var>x</var></p>
<p>y</p>

Change 379430 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Linter: Flag misnested tags with different behavior in HTML5 vs Tidy

https://gerrit.wikimedia.org/r/379430

This error is picking up a lot of unclosed HTML tags which we already have a different category for; example.

And another example with a list instead: example.

This category should put pages in one or the other category, not both. But, yes, some pages which were previously in the missing-end-tag category might end up here instead. But, nevertheless, I will take a look to check if there are any useless issues being flagged and can be tweaked.

And another example with a list instead: example.

REMEX & PARSOID
[subbu@earth maintenance] echo "*<span style='font-size:30%'>a\n*<span>b" | php parse.php --tidy
<ul>
<li><span style="font-size:30%">a</span></li>
<li><span>b</span></li>
</ul>

TIDY
[subbu@earth maintenance] echo "*<span style='font-size:30%'>a\n*<span>b" | php parse.php --tidy
<ul>
<li><span style="font-size:30%">a</span></li>
<li><span style="font-size:30%"><span>b</span></span></li>
</ul>

So, rather than missing end tag, this new category is the right category for this issue. As you can see, this will affect rendering because of how Tidy and a HTML5 parser fix up the broken markup. As I said, once the page ends up in this category, the same error won't pu tthe page in the missing-end-tag category.

Oh, I see that in the description.

Although looks like I need to tweak the detection for instances like this one: https://en.wikipedia.org/w/index.php?title=Talk:Romany_Jones&action=parsermigration-edit&lintid=40886426 .. since the output from tidy and from parsoid seems identical ... so, it is just a missing-end-tag error and not a more serious error affecting tidy replacement.

Although looks like I need to tweak the detection for instances like this one: https://en.wikipedia.org/w/index.php?title=Talk:Romany_Jones&action=parsermigration-edit&lintid=40886426 .. since the output from tidy and from parsoid seems identical ... so, it is just a missing-end-tag error and not a more serious error affecting tidy replacement.

Ugh .. this one is a bit messy .. I need to account for <a> tags and link-trails in my code to flag this issue.

Although looks like I need to tweak the detection for instances like this one: https://en.wikipedia.org/w/index.php?title=Talk:Romany_Jones&action=parsermigration-edit&lintid=40886426 .. since the output from tidy and from parsoid seems identical ... so, it is just a missing-end-tag error and not a more serious error affecting tidy replacement.

Ugh .. this one is a bit messy .. I need to account for <a> tags and link-trails in my code to flag this issue.

https://gerrit.wikimedia.org/r/#/c/381382 should take care of fixing the false positives

Change 381382 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Linter: Improve detection of html5-misnesting issues

https://gerrit.wikimedia.org/r/381382

Change 381382 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Linter: Improve detection of html5-misnesting issues

https://gerrit.wikimedia.org/r/381382

Okay, the fixes have been deployed and my tests show that they work as intended and eliminate the false positives from this category. I'll check if it is possible to purge existing entries from the database .. If not, as the pages get edited (null or actually), the false positives will go away.

I ran the replication-safe version of DELETE FROM linter WHERE linter_cat=12; on all wikis so the html5-misnesting category should be empty and will start to fill up again as pages are edited.