Page MenuHomePhabricator

Make SyntaxHighlight extension compatible with Parsoid
Closed, ResolvedPublic


Parsoid has its own extension API - see
In this first phase, we are targeting tag-hook extensions for migration.
The SyntaxHighlight extension needs an update to work directly with Parsoid.

Related Objects

Event Timeline

Arlolra triaged this task as Medium priority.Feb 25 2021, 6:55 PM

Parsoid is currently broken here because of T289545: Parsoid doesn't respect strip state markers found in preprocessor output. If that is fixed, Parsoid's current extension processing strategy will continue to work.

Even if we want to make the code work with Parsoid natively, T289545 needs fixing in some fashion. But, once that is done, the rest of the changes should be fairly straightforward once T287216 is completed.

Change 743026 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] WIP: Process nowikis in extensionSubstitution always

The semantics of <my_tag><nowiki>blah blah</nowiki></my_tag> doesn't require the nowiki to be stripped before passing the contents onto the extension. The behavior is consistent across all extensions.

But, what are the semantics of {{#tag:my_tag|<nowiki>blah blah</nowiki>}}. There are two possibilities here:

  1. We are protecting blah blah from being processed before being passed into the extension and so the nowiki should be stripped, i.e. effectively this should be preprocessed to wikitext <my_tag>blah blah</my_tag>. SyntaxHighlight seems to fall in this camp.
  2. The nowiki is actually part of the argument and should not be stripped, i.e. effectively this should be processed to wikitext <my_tag><nowiki>blah blah</nowiki></my_tag> Cite falls in this camp. So, given {{#tag:ref|<nowiki>'''x''</nowiki>}}, the nowiki is really part of the argument to the <ref> tag and the x should not be italicized.

From a wikitext processing semantics, without additional information, there is really no way to distinguish between the two behaviors. A wikitext processor (core parser or Parsoid) would have to pick one or the other consistently. And so, where Parsoid tries to use behavior #2 for all extensions, it breaks expectations, in this case, specifically SyntaxHighlight. Parsoid has a clearly decoupled processing pipeline where it preprocesses templates first and processes extensions next.

More generally, what are the semantics of {{#tag:my_tag|<your_tag>blah blah</your_tag>}}? This illustrates the problem better. In a decoupled processing model where templates and parser-functions are evaluated first and extensions evaluated after, it is clear that the output would be <my_tag><your_tag>blah blah</your_tag></my_tag> and so that is what Parsoid would consistently evaluate this to.

So, what happens with the core parser? Well, its wikitext processing isn't decoupled in the way Parsoid treats it. The core parser expands the arguments of a parser function. In the case where the argument has an extension tag, the extension tag is represented via a strip marker. So extensions need to somehow be aware of this parse state and that is how bugs like T16562: UNIQ key exposed when feeding strip markers into {{#tag:source and the various other strip state bugs seem to materialize. SyntaxHighlight works around this by unstripping nowiki markers which implicitly leads to behavior #1 above. So, this behavior (while desirable) seems to be a side-effect of a bug fix ( T16562: UNIQ key exposed when feeding strip markers into {{#tag:source ) and not necessarily deliberate.

With extension tags that wrap wikitext (ex: ref, pre), Parsoid's decoupled processing model yields the same results as the core parser. However, for extension tags that don't deal with wikitext (syntaxhlghlight, math, ce, hiero), we get incompatible behavior between Parsoid and the core parser. With the core parser, it is up to extensions to figure out how to deal with this and sometimes they don't deal with it. See this section on this page to see how output is different with syntaxhighlight, math, ce, hiero).

But, Parsoid's output, while different from the core parser, is consistent across all extensions and doesn't require extensions to deal with it at all and I think this is more defensible behavior. It also indicates that Parsoid will effectively treat nowikis in #tag parser function arguments as not-strippable by treating nowiki tags like any other extension tag not deserving of special treatment.

So, all said and done, what do we now do? One possibility is to keep Parsoid's behavior as is, but introduce a special backward-compatibility config flag in extension registrations for special nowiki handling behavior for extensions that need it (like SyntaxHighligh). But, perhaps let us wait and see how common this usage is before actually introducing this.

Presumably when the nowikis are meant for template (parserfunction) argument escaping, we'd want them to use heredoc syntax (T114432) instead.

T203293 is an example of using unstrip in lua modules for the protection of | in the arguments to the template.

Change 743026 abandoned by Subramanya Sastry:

[mediawiki/core@master] WIP: Process nowikis in extensionSubstitution always


Change 816086 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/core@master] WIP: Prototype hack to handle nowikis in args of {{#tag:ext|...}}

Change 816227 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] WIP: Add Parsoid support for syntaxhighlight

Change 816227 merged by jenkins-bot:

[mediawiki/extensions/SyntaxHighlight_GeSHi@master] Add Parsoid support for syntaxhighlight

Change 816086 merged by jenkins-bot:

[mediawiki/core@master] Added Parsoid support for nowiki stripping in args of {{#tag:ext|...}}