Page MenuHomePhabricator

Parsing of table cells in Parsoid
Open, LowPublic

Description

Parsoid is designed around being able to do early tokenization of wikitext in the PEG parser and have them be transformed in token transformers downstream.

Not all wikitext constructs lend themselves to this treatment. For example, table cell syntax is one of the problematic aspects of wikitext because syntax is very context-sensitive, i.e. what a "|" means varies depending on what its context is.

( Tangent: This syntactic mess bites us when we are converting HTML to wikitext as well, not just WT -> HTML. )

T88495, T48811, T69857, T69850, T52603, T46498, T178927, T112300 are all table-parsing bugs arising from this mismatch.

The description of T88495 has a helpful discussion of how this affects table attributes: attributes are either left out of cells OR content outside cells are treated as attributes. For the purposes of this task, we are concerned with the first problem which is implemented by the TableFixups DOM handlers in Parsoid but which has a bunch of special case handling for specific use cases but then we discover unhandled scenarios ( T178927, T112300 ) and edge cases.

So, short of fixing table syntax (which is not an option right now), we can either:

  • stop tokenizing table cells in the PEG tokenizer and tokenize table cells *after* templates are expanded by combining wikitext across template boundaries in TokenStreamPatcher. This would be a serious and big change and also goes against the grain of the strategy where we are trying to move towards template parsing independent of the top-level content.
  • we can rethink how we handle table cell reparsing in the DOM pass so we don't have all these special purpose handling and edge cases and missing support. To be clear, the code in in TableFixups is somewhat generic but it appears that the design is a bit lacking and that is what we might need to revisit at some point.