Page MenuHomePhabricator

Extension API: strip state issues
Open, HighPublic

Description

We need to figure out how Parsoid interacts with the concept of 'strip state' from the legacy parser.

The reason is that many existing extensions explicitly interact with strip state, for various reasons:

  • Protecting rawHTML content from the sanitizer
  • As an ad-hoc escape mechanism for template arguments ({{Foo|bar=<nowiki>something with |</nowiki>}} and then fetch the raw contents of the bar argument from the strip state)
  • Explicitly as part of the Scribunto API
  • other reasons yet to be discovered

There may not be one solution here, but as we port new extensions to Parsoid we need to come up with guidance for how the various uses of strip state get translated into the Parsoid extension API. This may (or may not) include adding an explicit "strip state" type API mechanism, or longer-term solutions (such as heredoc arguments instead of the ad-hoc <nowiki> argument escapes). There may be low-hanging fruit for certain uses (like tunneling content past the sanitizer) that can be forked off this task.

Event Timeline

ssastry triaged this task as Medium priority.Jul 15 2020, 5:49 PM
ssastry moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.
ssastry moved this task from Bugs & Crashers to Missing Functionality on the Parsoid board.
ssastry added a project: Parsoid-Rendering.

I filed T270127 as well. but we should probably figure out what strategy we want to adopt here. Narrowly support strip state solutions as in this task or temporarily support some form of shared parser object as in T270127 with an understanding that we will probably want some solution of the nature outlined here.

We can probably have something that vaguely looks like a strip state to handle "content which can't be represented as a string".

That is, when we invoke a legacy parser function which expects only strings as arguments, we take our token sequence, and anything in our token sequence which isn't "actually" a string gets turned into strip state. The parser function will eventually return a string, and we comb through that string for strip state markers and turn them back into the original tokens.

And thing for extensions which want to tunnel specific things to the output. They can add strip markers and items to the strip state, and we'll turn them into DOMFragments (our version of a "tunnel").

MSantos raised the priority of this task from Medium to High.Sep 19 2023, 3:54 PM

I think the basic idea of the above comment was that we should have a "native" representation of strip state in Parsoid, aka an opaque token on the input and an opaque DOM subtree on the output, and that we should pass that through Parsoid unmodified where possible and interface it with the output expected of the Parser. That is, an opaque strip state token on the input becomes an opaque subtree in the output, and we'll have a DOM-aware stripstate handler that will substitute in the appropriate HTML into the appropriate subtree. Waving hands a little bit about balanced output.

Put another way, right now the interface is that StripState is a mapping from a "raw HTML string" to an "opaque string marker", where the marker is constructed to make it pass through "all" string-based processing stages unaffected, while also not allowing it to be "spoofed" by user-generated content.

We would be tweaking that to say that StripState is a mapping from "raw HTML subtree" to an "opaque Parsoid token", where we have somewhat stronger guarantees that the token will make it through Parsoid unaffected and be unspoofable by users, with the token's parse result being a subtree marker in the DOM . Then in a final step we insert the "raw HTML" into the appropriate subtree of the DOM, using something like Element.innerHTML to ensure it doesn't break other parts of the tree.

(There's also a detail lurking here about how the Sanitizer interacts; Parsoid tends to sanitize "late" (ie, on the entire generated DOM) whereas legacy strip state is sometimes used explicitly to bypass the Sanitizer. We probably just want to make it configurable: at the time you insert the content into the strip state you have an optional flag to say whether or not the inserted content should be included in sanitization, but we might need explicit unspoofable markers in the DOM itself to guide the Sanitizer to include/exclude certain subtrees.)

Step 1 I think is coming up with a "Strip Marker" token in the tokenizer (not necessarily corresponding to string input matching the current strip marker format in core but it probably wouldn't hurt to start with that), and having it "parse" to an appropriate "hole" marker in the parsed output (maybe a <meta> tag?). Then in step 2 you'd write a "post post processing" pass that substitutes strip state markers in Parsoid output, which is probably where things get a bit hairy, since (a) you need to be 'rich attribute' aware in order to find the strip state markers inside attributes, etc, and (b) you have to think through what happens if your strip state contains block content and you're trying to insert it into a non-block context, etc; what happens if the strip state content is unbalanced, etc. But I think (b) has reasonable "don't do that, then" type answers in the short term, since in theory strip state is implementor-controlled and not user-generated content. Step 3 (could be done simultaneous with step 2) is to expose the 'cleaner' version of the strip state API in the parsoid extension API. Probably initially that's just a method like makeStripStateMarker(DocumentFragment $contents): Element which returns you the appropriate marker element (<meta> tag?) that will at the end have the DocumentFragment substituted in for it. That's just for parsoid-native extensions which are emitting output-containing-strip-state-markers as DOM, and probably used primarily just to make the given DocumentFragment bypass the sanitizer. Most actual use in extensions will probably use the step 1 support which has Parsoid recognizing strip state markers embedded in generated wikitext and parsing them into the appropriate marker.

^ I pointed @Arlolra at this task as a reasonable way to handle "special page transclusions not at top level" -- the nested special page transclusion should stick the HTML into strip state, and the parsoid will pull out the html from the strip state and tokenize it as a DomFragment token.

I have been hoping for Parsoid to fix T200704 "for free" (and there seems to be some belief it will) so I thought it would be a good idea to mention here should Parsoid decide to formalize "stripping".

Change #1088365 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Parsoid DataAccess: support strip state in parser function results

https://gerrit.wikimedia.org/r/1088365

@lzno can you recheck T200704 once https://gerrit.wikimedia.org/r/1088365 is deployed (next week hopefully)

Change #1088365 merged by jenkins-bot:

[mediawiki/core@master] Parsoid DataAccess: support strip state in parser function results

https://gerrit.wikimedia.org/r/1088365