Page MenuHomePhabricator

Parsoid doesn't respect strip state markers found in preprocessor output
Closed, DuplicatePublic

Description

StripState markers protect content that need to be protected from further "wikitext processing". So, where the return value from Parser::replaceValues has strip state markers set, Parsoid needs to actually tunnel that content through rather than unwrap and let it go through the rest of the Parsoid pipeline.

For example, some parser functions might return HTML not wikitext, and where the content might contain wikitext characters, we are now going to potentially mangle that output.

The reason we haven't run into this so far seems to be because none of the extensions deployed on the Wikimedia cluster register parser functions whose return values have the "isHTML" flag set.

Event Timeline

Change 714443 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Document bug in DataAccess.php as a FIXME

https://gerrit.wikimedia.org/r/714443

ssastry triaged this task as Medium priority.Aug 24 2021, 12:15 AM

But, what is surprising is that this hasn't resulted in any noticeable bug report (unless I missed it) from any 3rd party parsoid users.

Related, not but dupe .. but, we could still close this as a dupe of that task and update the description there.

The general problem is how to handle strip state markers in preprocessor output. Strip markers are inserted for things other than html-returning parser functions. So we just need to investigate all of them and figure out how to handle this.

For parser functions, we may need to introduce a different internal method for Parsoid's usecase to pass along isHTML information back to Parsoid so we than handle it properly rather than mess with strip markers in the return value, i.e. pass back better typed return values.

And, similarly for other cases where they are being used or maybe it is safe to simply unwrap them in those other cases.

Change 714443 abandoned by Subramanya Sastry:

[mediawiki/services/parsoid@master] Document bug in DataAccess.php as a FIXME

Reason:

https://gerrit.wikimedia.org/r/714443

The general problem is how to handle strip state markers in preprocessor output.

So can we merge this into T257606 then?

i.e. pass back better typed return values.

That sounds fine for top level parser function calls but maybe not for those nested in templates? We'd need a native preprocessor in place.

Well, T178588: Handle parser functions returning raw html and T279094: Support for parser functions using "isHTML"? are bugs due to this missing functionality, but I think strip state is slightly different from "parser functions returning html not wikitext" which is what those two bugs are about. But *one thing* stripstate is used for is to tunnel raw html through the parser.

Here is an example. See the difference for the SyntaxHighlightExtension: Core parser output vs Parsoid.

Change 714443 restored by Subramanya Sastry:

[mediawiki/services/parsoid@master] Document bug in DataAccess.php as a FIXME

https://gerrit.wikimedia.org/r/714443

Change 714443 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Document bug in DataAccess.php as a FIXME

https://gerrit.wikimedia.org/r/714443

Here is an example. See the difference for the SyntaxHighlightExtension: Core parser output vs Parsoid.

Not entirely accurate.

Parsoid passes OT::PREPROCESS flag which effectively decouples the preprocessing and the extension expansion pieces into two independent steps. So, Parsoid loses info between {{#tag:syntaxhighlight|<nowiki>...</nowiki>}} and <syntaxhighlight><nowiki..</nowiki></syntaxhighlight> syntactic forms.

However, when OT::HTML flag is passed (as is the case when the content is being parsed to HTML all the way without a decoupled preprocessing step), SyntaxHighlight receives preprocessed content with strip marker info intact which it exploits to distinguish between {{#tag:syntaxhighlight|<nowiki>...</nowiki>}} and <syntaxhighlight><nowiki..</nowiki></syntaxhighlight> syntactic forms.

In the Parsoid case, a 'general' strip marker is used for the nowiki (since it isn't being expanded) whereas in the legacy parser case, a 'nowiki' strip marker is used for the nowiki (which is then stripped out in SyntaxHighlight).

Needs more investigation to untangle this fully.

I think there's a general pattern (both here and in Scribunto module arguments especially) of using <nowiki> as a way to protect template arguments from the preprocessor. Scribuno in particular is in the habit of reaching into the strip state to get the original text that was inside the <nowiki> as a way to get the "raw" argument text.

We could probably just handle this in Parsoid's tokenizer, ie special case:

{{templatename|<nowiki>...</nowiki>|..}}

as an alternative template option syntax that means "pass this argument literally".

Some examples at https://en.wikipedia.org/wiki/User:Cscott/T289545

After pondering that for a bit, realized that this probably won't work since Parsoid calls the core preprocessor for expanding templates.

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/743026 might do the trick ... need to verify that doesn't break anything else.

Anyway, this discussion is now tangential to this task. Since we determined that, we should probably move additional discussion to T272939 or file a different task.

Okay, the work in this task includes T272939 and T279094. The first one is resolved. The second one is not. As such, at this point, this is just a dupe of T279094 and I will close this as such.