Page MenuHomePhabricator

Consider parsing transcluded special pages as raw HTML
Open, Needs TriagePublic

Description

The Parser code that handles special page transclusion flags the transcluded content as HTML. This prevents us from applying certain wikitext transformations to it, but not all of them. In particular, it is still subject to doBlockLevels.

Today I was experimenting with special page transclusion for T389892, and noticed something odd: the page uses a Mustache template indented with spaces. When this is added to the (transcluded) page output, the double-space indentation is therefore turned into <pre> tags in doBlockLevels. Note that this would apply to any piece of HTML indented with spaces, not necessarily mustache templates.

Now, obviously I can go and reindent the template. However, it seems odd to me that the Parser is applying these transformations to the HTML of a special page. The same HTML is (likely) used as-is on the special page itself, where it is already functional, and does not need additional transformations. Exactly two months ago, r1101112 introduced a new isRawHTML output mode for T381617. To me, it would make sense to switch special page transclusion to use that rather than isHTML, as that would prevent unwanted transformations. However, I imagine there might still be valid reasons to apply at least some of the transformation, and at the very least this would require a review of existing transcludable special pages (thankfully there aren't too many).

Event Timeline

Hi Content-Transform-Team! I'd be curious to hear your opinion on this, if possible. We're probably going with a workaround for T389892, but it'd be nice to consider this for the future.

Yes, this is mentioned in T381709: Parsoid support for 'general' vs 'nowiki' strip markers as a side effect of any transclusion which includes its output using the 'general' strip marker, and as mentioned in T381617: XSS in Charts extension combined with TemplateStyles, LanguageConverter, ToC I'd like to eventually deprecate the 'isHTML' return type in favor of 'isRawHTML' if at all possible.

And indeed Parsoid transcludes Special pages without invoking doBlockLevels on them; T381709 is the "bug" for this, which we might resolve by just defining Parsoid's way as "the right way".

I'm interested in your thoughts on the way forward if we decide to change this, and whether there's any way we could proactively audit for special page transclusions before just trying to switch things over and see what breaks.

Change #1138909 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Don't run doBlockLevels (etc) on Special page and "scary" transclusions

https://gerrit.wikimedia.org/r/1138909

And indeed Parsoid transcludes Special pages without invoking doBlockLevels on them; T381709 is the "bug" for this, which we might resolve by just defining Parsoid's way as "the right way".

I'm interested in your thoughts on the way forward if we decide to change this, and whether there's any way we could proactively audit for special page transclusions before just trying to switch things over and see what breaks.

I lack basically the entire context on Parser & Parsoid, but it sounds reasonable. I guess it ultimately boils down to how/where the isHTML mode is being used, aside from special page transclusion. I'm not sure if it's possible, but maybe for things that are just isHTML, could we also transform them as isRawHTML and log if there are differences? Let it sit for say a month, see what we get, then make the change.

For special pages in particular, I think the change is relatively safe. As I wrote in the task description, the HTML being transcluded is almost always the same HTML used on the special page itself, where it is output with no extra transformations. The "almost always" is due to special pages that change their behaviour when included. I suppose that could be audited manually, as there aren't many instances, and the negated ones can probably be filtered out. If we want to be super confident, we could do the logging as described above. Then mention it in the release notes as a breaking change and call it a day.