Page MenuHomePhabricator

Add compatibility support for nowiki unstripping in extensions in Parsoid's wt->html transformations
Closed, ResolvedPublic



  • In this document, preprocess-to-wt refers to the preprocessing concept in the core parser (Parser::preprocess, Parser::replaceVariables).
  • Given an extension tag X and wikiext string S="<X>str</X>", preprocess-to-wt(S) = S i.e. extension X's content is opaque to preprocess-to-wt
  • But, sometimes, as an editor, you might want to construct 'str' in some fashion before passing it onto X, i.e. {{my-map|coords=123:456}} might want to construct the appropriate input and if you used <map>{{{coords}}}</map> inside the template, (I made up this syntax) this won't do what you think this might do.
  • {{#tag:..}} parser function exists for exactly use cases like this.
    • Given S="{{#tag:X|str}}", preprocess-to-wt(S)="<X>".preprocess-to-wt(str)."</X>"
    • So, {{#tag:map|{{{coords}}}}}} will give you <map>123:456</map>

Problem TLDR

  • In the core parser, parse-to-html(preprocess-to-wt(wt)) != parse-to-html(wt). This is because in the latter case, the core parser leaves the wikitext in a partially-transformed state when it runs preprocess-to-wt internally which extensions can (sometimes have to!) inspect.
  • In Parsoid, parse-to-html(preprocess-to-wt(wt)) == parse-to-html(wt) since that is how Parsoid's pipeline is structured, i.e. there is no way to do the latter without fully preprocessing wt first. Decoupled processing is the Parsoid mojo.
  • For direct xml-tag invocations, this difference between Parsoid & core parser is not an issue since the content of the xml-tag is not touched by preprocess-to-wt.
  • But, if an extension tag is used via {{#tag:..}}, this difference between Parsoid & the core parser leads to rendering differences..


In the core parser,

  • In HTML-output mode, nowiki uses are replaced with a strip marker, and extensions know about this and use unstripNoWiki to deal with these however they want.
  • In preprocessing mode, nowiki uses are left alone. So, if a parser function is preprocessed, it is transformed to the XML-version and will get passed through to the extension.
    • If the extension DOES NOT deal with wikitext, this can be a problem since the nowiki is now a nonsensical tag for the extension. Ex: syntaxhighlight, etc.
    • If the extension deals with wikitext, this should be okay since the nowiki is just wikitext and will get handled properly when the wikitext is processed. Ex: ref, poem, etc.
      • Except if the extension uses the nowiki as a hack to tunnel wikitext content without needing to add a lot of escapes. This is what *some* templates do via Scribunto. These templates expect a nowiki strip marker and strip them. This effectively changes wikitext semantics of their template arguments. But, on the other hand, templates can do whatever they want with their arguments.

This only makes a difference for:

  • Extensions that are invoked via the {{#tag:..}} parser function and where the extension deals with the strip markers introduced by the core parser.
    • Ex: {{#tag:syntaxhighlight|<nowiki>foo</nowiki}}
  • Wikitext-processing extensions where there is no xml-like invocation syntax (ex: Scribunto)
    • Ex: {{#invoke:some-template|<nowiki>''foo''</nowiki>}}
    • Here, the template exploits knowledge of parser internals, i.e. it knows that the legacy parser uses strip markers for nowikis and then proceeds to call unstripNoWiki on them!.
      • Templates should not know about parser internals. This leaks implementation details into content. That said, nowiki is special. It is an escaping mechanism. So, the parser should expose a clean API for looking at nowiki content to everyone;. For now, unstripNowiki could be considered an API of sorts, but, some cleaning up might be worth it in the future.

More broadly,

  • This problem is not limited to the <nowiki> tag although <nowiki> tag is the most common scenario where this problem manifests because sometimes your extension might have substrings that might be confused for wikitext in {{#tag:..}} usage.
  • So, if you have extensions X and Y, <X><Y>foo</Y></X> and {{#tag:X|<Y>foo</Y>}} can do different things in the core parser and Parsoid. Note that <Y> might have special meaning within <X> and might not be an extension tag usage at all as it might be at the top-level.
    • In Parsoid, both forms reduce to <X><Y>foo</Y></X>. But, in the core parser, an extension X will get a strip-marker for Y in the #tag parser function form and will have to call unstripGeneral on its input before doing anything.
    • So, if you do a code search, you will see extensions call one of unstripNoWiki, unstripGeneral, unstripBoth on their input before proceeding. And, where they don't do this, you have the various leaking strip marker bug reports in phab. Parsoid will likely solve this problem by not introducing this in the first place.
      • Strictly speaking, this isn't entirely true. Parsoid handles DOM fragment tunnelling by leaving behind marker HTML tags with fragment ids which are then always unpacked. But, there are still likely edge cases where HTML content is embedded in attributes and other places which Parsoid doesn't have access to. In those edge cases, HTML marker ids with "mw:DOMFragment" typeof attributes will be left behind (equivalent to the core parser's strip state markers).

Solution strategies

So, overall, it looks like the only real issue here is wrt nowiki usage. Here, we will only solve for that problem. i.e. in the X,Y pair example above, we only deal with the special case where Y=nowiki.

Soln 1 (Naive, won't work)

The most obvious (and naive) solution would be change preprocess-to-wt(S) so it doesn't treat <nowiki>..</nowiki> substrings inside S as opaque - it is always stripped.

  • This immediately solves the problem for all non-wikitext extensions for whom <nowiki> has no special meaning.
  • But, it breaks usages for all wikitext extensions where <nowiki> has special meaning. So, this doesn't work.
Soln 2 (will work, needs time, not a short-term solution)

This nowiki usage in {{#tag:X|str}} mostly exists because of the need to escape characters in str from wikitext processing. Heredoc syntax (T114432) can completely solve the problem for them. But, we don't have that implemented at this time. Secondly, even after implementing, all wikitext usages will have to migrate over to heredoc usage where they need the protection. So, this is not a short-term project at this time. Had we done this 2-3 years back, we probably would have solved this by now.

Soln 3 (might work)

Extensions register a config flag telling us whether they deal with wikitext or not. So, for extensions that register this flag, we implement soln 1.

But, this won't do anything for Scribunto. We just throw up our hands and tell template editors: sorry, you can't use the hack you've been using. As it turns out, enwiki has already dealt with this for some templates at this point (Ex: Row numbers). But, there may be other templates and other wikis where this hack is being used. We can wait it out / lint this and hope for the best.

Soln 4 (should work)

This is just a tweak of Soln 3. Instead of extensions telling us whether they deal with wikitext or not, they tell us that they need "bc-unstrip-nowiki-support".

So, we implement Soln 1 for any extension that wants this support. The default value for this setting is false. So, SyntaxHighlight, Scribunto, CharInsert, and maybe a few other handful of extensions might register this flag in the config and that should be that!

We need to work through the details of this solution and implement it.

Things to resolve

  • Do extensions opt-in to the bc-unstrip-nowiki-support? At first glance, it seems opt-in is better than opt-out since (a) it makes the reliance on this behavior explicit which makes it easier to phase this out in the future (b) it let us incrementally improve Parsoid's compatibility with the legacy parser
  • What is the specific mechanism we want to provide extensions for opt-in/opt-out? Some proposals below (without having thought through how / whether they will work)
    • Extensions implement a marker interface
    • Extensions set some config value in extension.json
    • Something else.

Related Objects

Event Timeline

I'd like to make "protecting template arguments with <nowiki>" a first-class escape mechanism, rather than something that is only used by a few "special" parser functions/extensions. Basically, I don't like creating new unusual corner case features. So I'd rather this be opt-out, and *always* unstrip the arguments (or at least, always unstrip them if there's of the special form matching ^<nowiki>.*</nowiki>$. Alternatively, if there's an opt-in, it would be along with using the rest of our "improved sematics" package, aka registering your parser function globally instead of via onParserFirstCallInit, etc.

I think subbu and I are fundamentally thinking about this the same way, just that he wants the uniform mechanism to be heredocs. So he wants only a few special cases to get the special nowiki handling because eventually we'll move them all to heredoc escaping and turn the special nowiki off. That might work as well.

The question, I guess, is which requires more content migration. My feeling is that the amount of *content* which uses arguments which specifically start/end with <nowiki> open/close tags is small enough that we can globally introduce that as an escape mechanism without breaking other content, but maybe not small enough to ever hope to completely rewrite. And looking at the elephant from another direction, @subbu sees the amount of content as being small enough that we can feasibly migrate the entirety to heredoc and eliminate the corner case entirely.

One of my major considerations is that "opt-out" will probably be a breaking change whereas "opt-in" will not be .. it is the path to "eventual consistency" for this use case between Parsoid & the core parser. We could get there far enough and then switch the behavior from opt-out to opt-in (if necessary, by introducing a new flag).

But, I am not opposed to opt-out if we can quickly identify all extensions that would need to opt-out. This set will likely include all (and only those) extensions that deal with wikitext (except for Scribunto). So, Cite, Poem, ImageMap, indicator are the obvious ones that pop to mind. We'll have to co-ordinate deployment across all those repos as well.

Change 816086 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] WIP: Prototype hack to handle nowikis in args of {{#tag:ext|...}}

Change 817900 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Add config support for nowiki stripping for ext args in #tag calls

Heredoc support would have solved this problem in all use cases. But, in the absence of that, our options are as follows.

In the Wikimedia cluster, SyntaxHighlight is the only extension that relies on this, and as such, I think the simple hack proposed in #4 in the description (and implemented in the patches linked to this task) is sufficient to keep us moving along.

As for Scribunto, the uses are going to be in Lua modules. For uses where template authors are exploiting the presence of strip markers, the nowiki wrap is a convention between page editors and module authors (to avoid adding lots of escaping) where the module strips the nowiki first. So we cannot unconditionally unstrip nowikis when invoking them. Our options are:

  1. We deprecate this usage as unsupported. As it turns out, editors have already proceeded this route (See T203293#7603239) and provide other first class support for this in the future.
  2. The unstrip signal would have to come from templatedata (for lua modules) if we want to support this usage. Scott would like to introduce a raw-wikitext type for these args. If we do want to go the templatedata route, we will have to fix the #invoke parser function (like the #tag parser function). Note that the nowiki would have to be stripped in #invoke, not any of its caller wikitext transclusions.
  3. Wait for heredoc support to land so editors can escape blocks of wikiext (effectively allowing them to use the raw-wikitext type inline without needing TemplateData signals). This is probably the ideal / preferred solution since the nowiki wrappers being used and stripped in solution #2 is really a convention / hack that editors have come up with to deal with the lack of heredoc-like support.

If we want to consolidate behind the heredoc strategy eventually, #1 in the list above is probably a good short-term strategy for now. If so, we would need to add the deprecation info to our communications in this section.

Note that I don't think heredoc solves the /entire/ problem. It ensures that "special characters" aren't going to be mistaken for template/parser function delimiters, but it is the ultimate consumer of the parameter who ultimately needs to signal whether they want "the source text" or "html" or "expanded wikitext" or something else. Given:

this is '''bold''' but I can use a vertical bar | yay!

Some clients will eventually want the '''bold''' part rendered; one of the original use cases was to provide a robust quoting mechanism so that `{{cite}}`` could by applied to larger regions of wikitext without having to rewrite it to escape special characters, but in that case the "first parameter to cite" definitely *does* want to eventually be rendered as wikitext/have embedded templates expanded/etc.

So I think something like the on-demand typing is still necessary in addition to heredocs to allow the extension author to ultimately decide how they want to handle the argument contents, even after heredocs ensures they arrive intact.

Also I've had a bit of a hard think about how templatedata interacts with the extension mechanism. The whole point of templatedata was that it was an *optional* type system, and in fact editors could "improve" the types of templates over time. I'm a little bit uncertain about how editors would react if "bad templatedata" caused *rendering* issues not just *editing* issues. We will probably have some issues of this type with balanced templates, so this isn't novel, but still worth thinking about. Certainly we don't want to crash an extension implementation if it is expecting type X and the templatedata is changed to give it type Y instead. I think ultimately this points to the "demand-driven types" being a better mechanism for those extensions which can be ported to use it, since when the extension demands an argument in "type X" it should get type X, although templatedata can be a helpful hint especially in legacy cases where the legacy code is effectively demanding "a string" and we have great latitude on how exactly to construct that string.

I did a codesearch on enwiki to find other uses of this nowiki hack and found only one other instance and turns out that this is only used on the doc page the template. I did a quick search on dewiki, itwiki, ptwiki a bit randomly and didn't find this hack used on those other wikis.

So, the only major use seemed to have been the Row numbers template which has been fixed as I noted above.

It should be possible to decline support for this usage in Scribunto modules.

Change 817900 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add config support for nowiki stripping for ext args in #tag calls

I did a codesearch on enwiki to find other uses of this nowiki hack and found only one other instance and turns out that this is only used on the doc page the template. I did a quick search on dewiki, itwiki, ptwiki a bit randomly and didn't find this hack used on those other wikis.

Looks like I also needed to look at this search as well. uses this nowiki hack too but it isn't used anywhere.

I reopened T272507: mw.text.unstripNoWiki doesn't unstrip when using Parsoid for the non-hacky uses of nowiki so we decouple that work from this task.

Change 822177 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] Add functionality to enable Scribunto to support Parsoid properly

I expect if my current fix for T272507 sticks, it might also support the hacky nowiki uses used by templates / modules to minimize escaping.

Change 822641 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a19

Change 822641 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.16.0-a19

Change 816086 merged by jenkins-bot:

[mediawiki/core@master] Added Parsoid support for nowiki stripping in args of {{#tag:ext|...}}

Change 826408 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] Add support to enable Scribunto & Parsoid to handle nowikis properly

Change 826408 merged by jenkins-bot:

[mediawiki/core@master] Add support to enable Scribunto & Parsoid to handle nowikis properly

Change 822177 abandoned by Subramanya Sastry:

[mediawiki/core@master] Add functionality to enable Scribunto to support Parsoid properly


In favour of Ied0295feab06027a8df885b3215435e596f0353b

I can't find the related task at the moment, so I'm dumping some content from slack here for future reference.

<mapframe> ... </mapframe> -- the semantics are that the contents between tags is completely unprocessed. This is one of the few relatively reliable escape mechanisms. Of course (like with <poem> or <pre>) the extension could recursively invoke the parser on it, but it is given to the extension untouched.

{{#tag|mapframe|I can use {{templates}} here}} is given to the extension after expansion, I believe (could be wrong about this!). It is explicitly recommended in for example.

As with all things, there are probably ways to get different behavior if you "try hard", aka explicitly give the <tag> contents to a recursive invocation of the parser in the first case, and use the "strip state hack" to protect the template-y arguments in the second case [and that is what this particular task is about].

My personal goal w/ the extension API is to provide a more uniform interface, so that the extension author will basically have the same mechanisms available no matter how the extension (parser function, behavior switch, ....) is invoked, and they call $arg->asRawText() or $arg->asHtml() or $arg->asExpandedWikitext) or whatever to get the format they want.

The legacy parser API is also pretty ad hoc, and it's not well documented how to get the format you want from each type of extension point (extension tag, parser function, behavior switch, ..).

Note that we already have a similar mechanism for the output of an extension point (tag, parser function, etc) which can almost always be either "expanded wikitext" or "HTML". Sometimes the wikitext output is subject to further processing, sometimes not; sometimes you have to use strip state in order to access the "HTML" output. But there is some means to get different types of output, it's just not handled in a uniform way.

(This is expanding scope, but I'd also like to see the input and output types explicitly synchronized, to more cleanly handle the {{foo|{{bar}}}} case, and I'd like to add some structured data type like JSON or a name/value map (for parameters, attributes, etc) to the universe.)