Background
- In this document, preprocess-to-wt refers to the preprocessing concept in the core parser (Parser::preprocess, Parser::replaceVariables).
- Given an extension tag X and wikiext string S="<X>str</X>", preprocess-to-wt(S) = S i.e. extension X's content is opaque to preprocess-to-wt
- But, sometimes, as an editor, you might want to construct 'str' in some fashion before passing it onto X, i.e. {{my-map|coords=123:456}} might want to construct the appropriate input and if you used <map>{{{coords}}}</map> inside the template, (I made up this syntax) this won't do what you think this might do.
- {{#tag:..}} parser function exists for exactly use cases like this.
- Given S="{{#tag:X|str}}", preprocess-to-wt(S)="<X>".preprocess-to-wt(str)."</X>"
- So, {{#tag:map|{{{coords}}}}}} will give you <map>123:456</map>
Problem TLDR
- In the core parser, parse-to-html(preprocess-to-wt(wt)) != parse-to-html(wt). This is because in the latter case, the core parser leaves the wikitext in a partially-transformed state when it runs preprocess-to-wt internally which extensions can (sometimes have to!) inspect.
- In Parsoid, parse-to-html(preprocess-to-wt(wt)) == parse-to-html(wt) since that is how Parsoid's pipeline is structured, i.e. there is no way to do the latter without fully preprocessing wt first. Decoupled processing is the Parsoid mojo.
- For direct xml-tag invocations, this difference between Parsoid & core parser is not an issue since the content of the xml-tag is not touched by preprocess-to-wt.
- But, if an extension tag is used via {{#tag:..}}, this difference between Parsoid & the core parser leads to rendering differences..
Details
In the core parser,
- In HTML-output mode, nowiki uses are replaced with a strip marker, and extensions know about this and use unstripNoWiki to deal with these however they want.
- In preprocessing mode, nowiki uses are left alone. So, if a parser function is preprocessed, it is transformed to the XML-version and will get passed through to the extension.
- If the extension DOES NOT deal with wikitext, this can be a problem since the nowiki is now a nonsensical tag for the extension. Ex: syntaxhighlight, etc.
- If the extension deals with wikitext, this should be okay since the nowiki is just wikitext and will get handled properly when the wikitext is processed. Ex: ref, poem, etc.
- Except if the extension uses the nowiki as a hack to tunnel wikitext content without needing to add a lot of escapes. This is what *some* templates do via Scribunto. These templates expect a nowiki strip marker and strip them. This effectively changes wikitext semantics of their template arguments. But, on the other hand, templates can do whatever they want with their arguments.
This only makes a difference for:
- Extensions that are invoked via the {{#tag:..}} parser function and where the extension deals with the strip markers introduced by the core parser.
- Ex: {{#tag:syntaxhighlight|<nowiki>foo</nowiki}}
- Wikitext-processing extensions where there is no xml-like invocation syntax (ex: Scribunto)
- Ex: {{#invoke:some-template|<nowiki>''foo''</nowiki>}}
- Here, the template exploits knowledge of parser internals, i.e. it knows that the legacy parser uses strip markers for nowikis and then proceeds to call unstripNoWiki on them!.
- Templates should not know about parser internals. This leaks implementation details into content. That said, nowiki is special. It is an escaping mechanism. So, the parser should expose a clean API for looking at nowiki content to everyone;. For now, unstripNowiki could be considered an API of sorts, but, some cleaning up might be worth it in the future.
More broadly,
- This problem is not limited to the <nowiki> tag although <nowiki> tag is the most common scenario where this problem manifests because sometimes your extension might have substrings that might be confused for wikitext in {{#tag:..}} usage.
- So, if you have extensions X and Y, <X><Y>foo</Y></X> and {{#tag:X|<Y>foo</Y>}} can do different things in the core parser and Parsoid. Note that <Y> might have special meaning within <X> and might not be an extension tag usage at all as it might be at the top-level.
- In Parsoid, both forms reduce to <X><Y>foo</Y></X>. But, in the core parser, an extension X will get a strip-marker for Y in the #tag parser function form and will have to call unstripGeneral on its input before doing anything.
- So, if you do a code search, you will see extensions call one of unstripNoWiki, unstripGeneral, unstripBoth on their input before proceeding. And, where they don't do this, you have the various leaking strip marker bug reports in phab. Parsoid will likely solve this problem by not introducing this in the first place.
- Strictly speaking, this isn't entirely true. Parsoid handles DOM fragment tunnelling by leaving behind marker HTML tags with fragment ids which are then always unpacked. But, there are still likely edge cases where HTML content is embedded in attributes and other places which Parsoid doesn't have access to. In those edge cases, HTML marker ids with "mw:DOMFragment" typeof attributes will be left behind (equivalent to the core parser's strip state markers).
Solution strategies
So, overall, it looks like the only real issue here is wrt nowiki usage. Here, we will only solve for that problem. i.e. in the X,Y pair example above, we only deal with the special case where Y=nowiki.
Soln 1 (Naive, won't work)
The most obvious (and naive) solution would be change preprocess-to-wt(S) so it doesn't treat <nowiki>..</nowiki> substrings inside S as opaque - it is always stripped.
- This immediately solves the problem for all non-wikitext extensions for whom <nowiki> has no special meaning.
- But, it breaks usages for all wikitext extensions where <nowiki> has special meaning. So, this doesn't work.
Soln 2 (will work, needs time, not a short-term solution)
This nowiki usage in {{#tag:X|str}} mostly exists because of the need to escape characters in str from wikitext processing. Heredoc syntax (T114432) can completely solve the problem for them. But, we don't have that implemented at this time. Secondly, even after implementing, all wikitext usages will have to migrate over to heredoc usage where they need the protection. So, this is not a short-term project at this time. Had we done this 2-3 years back, we probably would have solved this by now.
Soln 3 (might work)
Extensions register a config flag telling us whether they deal with wikitext or not. So, for extensions that register this flag, we implement soln 1.
But, this won't do anything for Scribunto. We just throw up our hands and tell template editors: sorry, you can't use the hack you've been using. As it turns out, enwiki has already dealt with this for some templates at this point (Ex: Row numbers). But, there may be other templates and other wikis where this hack is being used. We can wait it out / lint this and hope for the best.
Soln 4 (should work)
This is just a tweak of Soln 3. Instead of extensions telling us whether they deal with wikitext or not, they tell us that they need "bc-unstrip-nowiki-support".
So, we implement Soln 1 for any extension that wants this support. The default value for this setting is false. So, SyntaxHighlight, Scribunto, CharInsert, and maybe a few other handful of extensions might register this flag in the config and that should be that!
We need to work through the details of this solution and implement it.
Things to resolve
- Do extensions opt-in to the bc-unstrip-nowiki-support? At first glance, it seems opt-in is better than opt-out since (a) it makes the reliance on this behavior explicit which makes it easier to phase this out in the future (b) it let us incrementally improve Parsoid's compatibility with the legacy parser
- What is the specific mechanism we want to provide extensions for opt-in/opt-out? Some proposals below (without having thought through how / whether they will work)
- Extensions implement a marker interface
- Extensions set some config value in extension.json
- Something else.