Page MenuHomePhabricator

Use consistent serializers in OutputTransform stages
Open, Needs TriagePublic

Description

We use two or three different HTML serializers in the OutputTransform pipeline:

  1. Sanitizer: The legacy parser uses a serializer which is very careful to always quote attributes, and always use double-quotes (not single quotes). This is central to how the Sanitizer works, as several sanitization steps and regexps which run on output HTML rely on these conventions. See Sanitizer::safeEncodeTagAttributes; the methods Sanitizer::stripAllTags and Sanitizer::removeSomeTags should be verified against these assumptions as well.
  2. HtmlHelper: Passes which use HtmlHelper use Remex to serialize mutated HTML. Two different formatters are used, an "html5 format" and a RemexCompatFormatter which (in theory!) is compatible with the Sanitizer.
  3. Parsoid: Parsoid historically uses its own XmlSerializer, exposed via methods link DOMUtils etc. This formatter tried hard to minimize the size of output HTML by (for example) dynamically selecting between single-quoted and double-quoted attributes based on which would yield the shortest string after whatever necessary HTML entity escaping was done. Other than attribute quoting, it tried to be as compatible with the HTML5 "standard" serialization as possible, although there are likely some small differences in (eg) how <pre> tags are dumped, as the HTML5 standard was tweaked in some regards there over time.

The two main areas of incompatibility are the (1) the handling of void tags -- whether or not /> or > is used; and (2) attribute quoting. We should try to unify these across Parsoid and the legacy parser/MediaWiki core code base to avoid "mysterious" test failures where slashes appear/disappear in expected output depending on the exact transformations applied to the HTML, and to avoid the possibility of Sanitizer or regexp failures caused by the assumptions of the Sanitizer being invalidated by the serializations done by (2) or (3).

Related: