Page MenuHomePhabricator

Parsoid support for 'general' vs 'nowiki' strip markers
Open, LowPublic

Description

In the legacy parser, "general" strip markers have doBlockLevels, LanguageConversion, and some other passes applied to them (see T381617#10387908) while 'nowiki' strip markers do not.

Parsoid is never going to implement legacy doBlockLevels, but we could (should?) tag the HTML generated from 'general' strip markers for language conversion.

This applies to Special page transclusions as well as extension-generated HTML, although perhaps not *intentionally*. See T391109: Consider parsing transcluded special pages as raw HTML for instance.

On the other hand, maybe we'll just deprecate the 'general' strip marker type and/or the 'isHTML' return mode (in favor of 'isRawHTML', T381617).

Event Timeline

One major difference, is if you use recursiveTagParse() and output the result in a strip marker, the use of a general strip marker means that links still work, where in a nowiki strip marker, i don't think they do.

That's something to keep an eye on for sure. I had the opposite (but probably related) issue with the {{#interwikilink}} parser function, which is that I needed to use "raw html mode" to ensure the generated links showed up in the output (which uses a strip marker under the hood).

Change #1138909 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Don't run doBlockLevels (etc) on Special page and "scary" transclusions

https://gerrit.wikimedia.org/r/1138909

See T3319: doBlockLevels inserts pre-tags in a text created by an extension for some history. Note the original rationale for applying doBlockLevels was (T3319#53982):

The reason Gabriel put doBlockLevels() last, after unstrip(), is because he wanted the output to be valid XHTML. That means not having block-level elements inside other block-level elements. doBlockLevels() will automatically detect nested block-level elements, and will break up the outer one to produce valid HTML.

The solution I suggested, in private discussion years ago, is to have doBlockLevels() scan the stripped HTML for block-level elements, and if they are present, to treat the strip marker as a block-level element. Then the resulting string can be unstripped to produce valid XHTML.

We can actually do that with Remex now, and probably should. This is a special case of T114445: [RFC] Balanced templates.

MSantos added a project: Essential-Work.
MSantos subscribed.

We might need to select a list of pages to add to the VisualDiff sample so we ensure these are getting tested.