Page MenuHomePhabricator

Allow <thead> <tbody> <tfoot> as literal HTML tags in Wikitext
Open, Needs TriagePublic

Description

This is a down-scoped version of T6740: thead, tbody, tfoot for wikitable syntax which proposed "wikitext-ish" vertical-bar-and-braces syntax for these table features. This task is "just" for allowing these tags to be included as literal HTML in wikitext. (Previously: T5156, and I'm sure others.)

The core is a 1- or 2-line patch to the Sanitizer to allow these tags through when present as tag literals in wikitext. But that opens up a number of possible issues that would need to be understood and worked through:

  • As discussed some in T6740#8383265 and T289817#8225410 the jquery.tablesorter.js component "already does" this, we'd want to make sure a literal <thead> doesn't break tablesorter when used.
  • Need to think through what attributes should be allowed here on these. Probably the elements the sanitizer allows for <table> and <tr> are appropriate?
  • As mentioned in T5156#74956 we'd have to verify that Remex Tidy handles corner cases, since as nesting or "orphaned" <tr>/<thead> elements correctly. (This is *probably* correct and/or reasonable, since Remex is based on HTML5 semantics, but we haven't actually *tested* this corner of its behavior AFAIK...)
  • There will be interactions with "wikitext" table syntax. What happens if you start a wikitext table with {| and then insert a literal <thead> tag? Is that behavior consistent in Parsoid and the legacy parser, and the exact behavior something we want to support forever? The legacy parser's table-handling is pretty janky (T134469, etc) -- should we try to explicitly disable <thead> and friends if we're not in a "literal HTML" table, so we don't get inconsistencies between the legacy parser and Parsoid?
  • Do the <thead> etc elements play nicely with the current WMF skins and article CSS?
  • Do the <thead> etc elements play nicely with the transformations done for mobile?
  • How does this interact with <caption>/<colgroup>/<col> and <table summary="..."> and the scope and headers attributes? (See https://developer.mozilla.org/en-US/docs/Learn/HTML/Tables/Advanced). Perhaps we want to roll this out as part of a more-complete "HTML5 table" feature?
  • Should we also support a "default" <thead> element -- that is, in the absence of an explicit <thead>, if the first row(s) of a table contain nothing but "th" cells, should they be hoisted into a default <thead>? Is that even useful, if the <thead> contains no class or id attributes? (And how would that hoisting interact with something like {{#attr}} (T230658). (This is probably a big enough feature chunk to merit a subtask of its own.)

Improving table support is probably worth putting on the Content-Transform-Team roadmap, but not until after the Parsoid-Read-Views migration.

Event Timeline

The HTML versions of the table elements do get cleaned up by something in PHP today. Today, literal wikitext

<table>
  <tr>
    <th>ABC</th>
  </tr>
  <td>DEF</td>
</table>

gets cleaned up to

<table>
  <tbody>
    <tr>
      <th>ABC</th>
    </tr>
    <tr>
      <td>DEF</td>
    </tr>
  </tbody>
</table>

(You'll notice two cleanings there, which maybe speaks to one of your questions.) And I'd speculate at least tbody addition is generally anticipated, so allowing for explicit tbody et al needs to avoid breaking that old behavior for some lengthy period if not indefinitely. I.e. unless there is a tbody found in the table parsing, MediaWiki should continue to auto-insert these. There are definitely use cases for manually adding explicit tbody (supporting tables where you can collapse multiple parts of the table is a feature request I've heard a few times, where makeCollapsible currently operates on tbody from memory [and now I see where this discussion came from]). I suspect thead would have much the same rationale.

There will be interactions with "wikitext" table syntax. What happens if you start a wikitext table with {| and then insert a literal <thead> tag? Is that behavior consistent in Parsoid and the legacy parser, and the exact behavior something we want to support forever? The legacy parser's table-handling is pretty janky (T134469, etc) -- should we try to explicitly disable <thead> and friends if we're not in a "literal HTML" table, so we don't get inconsistencies between the legacy parser and Parsoid?

I know that someone or another from content transform (if not you, probably Subbu) has suggested that literal HTML tables shouldn't play nice with wikitext tables forever and to be honest I'm in favor of that, it's just that we need some sort of linting for it, with associated migration period. This change might be motivation for that given all those questions.

I think we still ultimately need a wikitext syntax for representing these (tfoot and tbody at least I think). Ad hoc wikitables in the wild occasionally have multiple bodies and multiple repeated column heads--hopefully something that can be removed with sticky table headers at some point, and occasionally a full-width table footer most often with a key of some sort (which should be above the table).

Do the <thead> etc elements play nicely with the current WMF skins and article CSS?

As noted, tbody is already injected. thead would need review. The only system that adds it is makeCollapsible.js. We'd have to look to see how many > tr > th stylings there are in the wild that might break if thead were supported. I don't anticipate tfoot as an interesting question.

<caption>, <table summary="..."> and the scope and headers attributes?

In general, I don't see this as an issue beyond 'normal' HTML validation perspective. We (editors) treat Perfectly Marked Up Tables as a Best Effort thing, adding scopes as we go, etc. Right now linting is pretty weak, simply following the rendering model (so I guess the fostering algorithm might catch misordering or misparenting of the variety of the prospective elements). The HTML domain model could (perhaps should) be checked by the linter to ensure Good Formedness. I don't think there's a task for that. It would obviously have other uses but I anticipate that would be some work.

Regarding <table summary="..."> particularly, the summary attribute is deprecated in HTML 5 (well, WHATWG HTML, IDK if the last HTML 5 version released considered it deprecated) and should be in the T173944: Linter should lint for obsolete HTML attributes pile (though that says obsolete, we should be checking deprecated attributes also).

<colgroup>/<col>

These are also filtered out of wikitext similarly to this task and I know of no system that adds them in JS. Work done to support this task might reasonably also look into T2986: [tables] Please implement COL, COLGROUP. What you can do with styling the 'logical children' of those elements is not much at all (the only styles that can target a col's "children" are border, background, width, and visibility, see CSS2.1, basically only the ones that had support in an attribute form), so there's not a lot of value to adding them that I know of (can't even style text alignment; the only alternative for such kinds of stylings is :nth-child(), available only for scripts and TemplateStyles).

[Scribunto]

Pretty sure the mw.html library in Scribunto does no validation in lieu of relying on sanitizing, but another double check point.

Should we also support a "default" <thead> element -- that is, in the absence of an explicit <thead>, if the first row(s) of a table contain nothing but "th" cells, should they be hoisted into a default <thead>? Is that even useful, if the <thead> contains no class or id attributes? (And how would that hoisting interact with something like {{#attr}} (T230658). (This is probably a big enough feature chunk to merit a subtask of its own.)

Probably sufficiently desirable if makeCollapsible wants it, and I haven't really noticed that logic to fail on tables that were designed for it. OTOH there are probably many tables out there not designed for it, so we might need to lint some things? I know that there are still several old-style tables like:

<table>
  <tr>
    <th colspan="2">This should really be a &lt;caption&gt;
  <tr>
    <th>
    <th>
</table>

But those would be taken care of by catching the multiple rows as expected. There are also cases like Template:Navbox which will be divified at some point in the future, but I don't think those are an issue.

Regarding particularly {{#attr}}, even if that had consensus in the form you've suggested, <thead> is similar to <ul> in a lot of ways as in your comment voiced there regarding list containers. An automatic one could reasonably be presumed to be a certain quality even without other attributes, just as <tbody> is today. (A general improvement as demonstrated above might be to add a .mw-auto-insertion, just to make it obvious to scripts and/or even the parser that placement of the elements may not be perfect. And might reasonably be used to mark up the insertion of other things like <ul> and <figure>.)

Need to think through what attributes should be allowed here on these. Probably the elements the sanitizer allows for <table> and <tr> are appropriate?

I would like to avoid adding support for deprecated or obsolete attributes in HTML, so per MDN, I'd prefer to add just the global attributes that are already supported in wikitext (id, style, aria-*, etc. etc.). Users haven't had access to these elements before, and with the existing globals they should be able to find other ways to deal with their lack of align and the like.

Change 948593 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Sanitizer: Update to HTML 5 and allow some elements

https://gerrit.wikimedia.org/r/948593