Page MenuHomePhabricator

Table parsing diffs: Parsoid adds implicit <td>s after a |- if explicit pipe is not present
Closed, ResolvedPublic

Description

This has been known for a long time, but this is beginning to be the cause of some parsing and rendering diffs.

See this test in parserTests.txt with copious notes.

## Note that Parsoid output differs from PHP and PHP+tidy here.
## The lack of <tr> tags in the PHP output is arguably a bug in the
## PHP parser, which tidy then compounds by fostering the content
## entirely out of the table.  Parsoid recognizes the table context
## and generates <tr> and <td> wrappers as needed.  Hopefully nobody
## depends on PHP's treatment of broken table markup!
!! test
Implicit <td> after a |-
!! options
parsoid=wt2html,html2html
!! wikitext
{|
|-
a
|}
!! html/php
<table>

a
</table>

!! html/php+tidy
<p>a</p>
!! html/parsoid
<table>
<tr><td>a</td></tr>
</table>
!! end

As it turns out, this does matter in production.

  1. https://www.mediawiki.org/wiki/Parsoid/Round-trip_testing/Diffs#Implicit_.3Ctd.3E_insertion
  2. compare https://en.wikipedia.org/w/index.php?title=Hartmut_Briesenick&oldid=677208396 vs http://parsoid-lb.eqiad.wikimedia.org/enwiki/Hartmut_Briesenick?oldid=677208396. The reduced test case is https://www.mediawiki.org/wiki/User:SSastry_%28WMF%29/Tests/Tables/1 vs http://parsoid-lb.eqiad.wikimedia.org/mediawikiwiki/User%3ASSastry_%28WMF%29%2FTests%2FTables%2F1?oldid=1856312

In the general case, Parsoid's PEG tokenizer doesn't know whether the line that follows a "|-" is going to be a <td> markup or not. The tokenizer assumes it is going to and adds a <td> if none is present. However, in some odd scenarios like the above, existing pages seem to be relying on the fact that the absence of the <td> leads to the <table> being closed (and then dropped by Tidy since the table is empty -- Tidy likes to drop empty elements).

So, once we replace Tidy (T89331) with a HTML5 parser / serializer, all these pages will see empty-table-ghosts coming back from the dead. So, Parsoid fixes for this will be dependent on how we decide to handle this Tidy fixup scenario.

Related Objects

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry subscribed.

The bug reported in T132668: Table render bug on mw.org is because of this wikitext generated by the {{TNT}} template:

{| class="tpl-infobox ext-infobox ext-status-beta" cellspacing="0"
|+ '''[[Special:MyLanguage/Manual:Extensions|MediaWiki extensions manual]]'''
|- class="tpl-infobox-header ext-infobox-header"
! colspan="2" style="padding-top: 0.5em;" | [[File:Crystal Clear action run.png|link=:Special:MyLanguage/Template:Extension#Content|left|40px]] <span style="font-size: 130%;">Kartographer</span><br />
[[Extension status|Release status:]] beta[[Category:MIT licensed extensions]][[Category:beta status extensions]]
|-
[[Category:MediaWiki extensions without a screenshot]]
| style="vertical-align: top" | [[Special:MyLanguage/Template:Extension#type|'''Implementation''']]
| [[Manual:Tag extensions|Tag]][[Category:Tag extensions]], [[Manual:ContentHandler|ContentHandler]][[Category:ContentHandler extensions]]
...

The category is allocated its only <td>. If we fix this implicit-<td> issue, the category link will get fostered out which is fine we handle that scenario,

Change 335492 had a related patch set uploaded (by Arlolra):
T109897: Remove implicit_table_data_tag rule

https://gerrit.wikimedia.org/r/335492

Change 335492 merged by jenkins-bot:
T109897: Remove implicit_table_data_tag rule

https://gerrit.wikimedia.org/r/335492