Extension API: strip state issues
Open, HighPublic
Actions

Assigned To

None

Authored By

	cscott
	Jul 9 2020, 6:20 PM

Description

We need to figure out how Parsoid interacts with the concept of 'strip state' from the legacy parser.

The reason is that many existing extensions explicitly interact with strip state, for various reasons:

Protecting rawHTML content from the sanitizer
As an ad-hoc escape mechanism for template arguments ({{Foo|bar=<nowiki>something with |</nowiki>}} and then fetch the raw contents of the bar argument from the strip state)
Explicitly as part of the Scribunto API
other reasons yet to be discovered

There may not be one solution here, but as we port new extensions to Parsoid we need to come up with guidance for how the various uses of strip state get translated into the Parsoid extension API. This may (or may not) include adding an explicit "strip state" type API mechanism, or longer-term solutions (such as heredoc arguments instead of the ad-hoc <nowiki> argument escapes). There may be low-hanging fruit for certain uses (like tunneling content past the sanitizer) that can be forked off this task.

Related Objects
Search...

Status	Assigned	Task
Open	None	T352451 Parsoid runs ParserAfterTidy & ParserAfterParse hooks multiple times, causing problems for DiscussionTools
Open	None	T352453 Invoke ParserBeforeInternalParse from Parsoid
Open	None	T257606 Extension API: strip state issues
Resolved	ssastry	T203293 Strip state: {{Row numbers}} completely fails on the Android app / VisualEditor
Resolved	ssastry	T299103 Add compatibility support for nowiki unstripping in extensions in Parsoid's wt->html transformations

Event Timeline

cscott created this task.Jul 9 2020, 6:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 9 2020, 6:20 PM

cscott added a subtask: T203293: Strip state: {{Row numbers}} completely fails on the Android app / VisualEditor.Jul 9 2020, 6:21 PM

cscott mentioned this in T203293: Strip state: {{Row numbers}} completely fails on the Android app / VisualEditor.Jul 9 2020, 6:23 PM

ssastry triaged this task as Medium priority.Jul 15 2020, 5:49 PM

ssastry moved this task from Needs Triage to Bugs & Crashers on the Parsoid board.

ssastry moved this task from Bugs & Crashers to Missing Functionality on the Parsoid board.

ssastry added a project: Parsoid-Rendering.

MBinder_WMF moved this task from Missing Functionality to Following Quarter on the Parsoid board.Jul 16 2020, 6:17 PM

cscott updated the task description. (Show Details)Jul 16 2020, 6:24 PM

Restricted Application added a subscriber: jeblad. · View Herald TranscriptJul 16 2020, 6:24 PM

I filed T270127 as well. but we should probably figure out what strategy we want to adopt here. Narrowly support strip state solutions as in this task or temporarily support some form of shared parser object as in T270127 with an understanding that we will probably want some solution of the nature outlined here.

Arlolra mentioned this in T248310: Parsoid HTML for articles about numerals contains unparsed Lua invocations.Mar 31 2021, 11:34 PM

Arlolra mentioned this in T283115: Nowiki is rendered in code template call output.May 24 2021, 8:45 PM

Arlolra moved this task from Following Quarter to Missing Functionality on the Parsoid board.Aug 2 2021, 8:56 PM

Arlolra mentioned this in T289545: Parsoid doesn't respect strip state markers found in preprocessor output.Aug 26 2021, 7:09 PM

ssastry added a project: Parsoid-Read-Views (Phase 1 - DiscussionTools support).Sep 29 2021, 10:15 PM

ssastry added a subtask: T299103: Add compatibility support for nowiki unstripping in extensions in Parsoid's wt->html transformations.Jan 12 2022, 10:28 PM

ssastry moved this task from Backlog to Talk Page Read Views on the Parsoid-Read-Views (Phase 1 - DiscussionTools support) board.Jan 26 2022, 7:46 PM

• JMcLeod_WMF moved this task from Talk Page Read Views to Backlog on the Parsoid-Read-Views (Phase 1 - DiscussionTools support) board.May 16 2022, 3:46 PM

Physikerwelt mentioned this in T310249: Review stalled Math Patches in Gerrit.Jun 9 2022, 9:54 AM

We can probably have something that vaguely looks like a strip state to handle "content which can't be represented as a string".

That is, when we invoke a legacy parser function which expects only strings as arguments, we take our token sequence, and anything in our token sequence which isn't "actually" a string gets turned into strip state. The parser function will eventually return a string, and we comb through that string for strip state markers and turn them back into the original tokens.

And thing for extensions which want to tunnel specific things to the output. They can add strip markers and items to the strip state, and we'll turn them into DOMFragments (our version of a "tunnel").

ssastry closed subtask T203293: Strip state: {{Row numbers}} completely fails on the Android app / VisualEditor as Resolved.Aug 10 2022, 5:12 PM

ssastry closed subtask T299103: Add compatibility support for nowiki unstripping in extensions in Parsoid's wt->html transformations as Resolved.Nov 14 2022, 6:09 PM

MSantos added a project: Content-Transform-Team-WIP.Apr 10 2023, 3:51 PM

MSantos raised the priority of this task from Medium to High.Sep 19 2023, 3:54 PM

I think the basic idea of the above comment was that we should have a "native" representation of strip state in Parsoid, aka an opaque token on the input and an opaque DOM subtree on the output, and that we should pass that through Parsoid unmodified where possible and interface it with the output expected of the Parser. That is, an opaque strip state token on the input becomes an opaque subtree in the output, and we'll have a DOM-aware stripstate handler that will substitute in the appropriate HTML into the appropriate subtree. Waving hands a little bit about balanced output.

Put another way, right now the interface is that StripState is a mapping from a "raw HTML string" to an "opaque string marker", where the marker is constructed to make it pass through "all" string-based processing stages unaffected, while also not allowing it to be "spoofed" by user-generated content.

We would be tweaking that to say that StripState is a mapping from "raw HTML subtree" to an "opaque Parsoid token", where we have somewhat stronger guarantees that the token will make it through Parsoid unaffected and be unspoofable by users, with the token's parse result being a subtree marker in the DOM . Then in a final step we insert the "raw HTML" into the appropriate subtree of the DOM, using something like Element.innerHTML to ensure it doesn't break other parts of the tree.

(There's also a detail lurking here about how the Sanitizer interacts; Parsoid tends to sanitize "late" (ie, on the entire generated DOM) whereas legacy strip state is sometimes used explicitly to bypass the Sanitizer. We probably just want to make it configurable: at the time you insert the content into the strip state you have an optional flag to say whether or not the inserted content should be included in sanitization, but we might need explicit unspoofable markers in the DOM itself to guide the Sanitizer to include/exclude certain subtrees.)

Step 1 I think is coming up with a "Strip Marker" token in the tokenizer (not necessarily corresponding to string input matching the current strip marker format in core but it probably wouldn't hurt to start with that), and having it "parse" to an appropriate "hole" marker in the parsed output (maybe a <meta> tag?). Then in step 2 you'd write a "post post processing" pass that substitutes strip state markers in Parsoid output, which is probably where things get a bit hairy, since (a) you need to be 'rich attribute' aware in order to find the strip state markers inside attributes, etc, and (b) you have to think through what happens if your strip state contains block content and you're trying to insert it into a non-block context, etc; what happens if the strip state content is unbalanced, etc. But I think (b) has reasonable "don't do that, then" type answers in the short term, since in theory strip state is implementor-controlled and not user-generated content. Step 3 (could be done simultaneous with step 2) is to expose the 'cleaner' version of the strip state API in the parsoid extension API. Probably initially that's just a method like makeStripStateMarker(DocumentFragment $contents): Element which returns you the appropriate marker element (<meta> tag?) that will at the end have the DocumentFragment substituted in for it. That's just for parsoid-native extensions which are emitting output-containing-strip-state-markers as DOM, and probably used primarily just to make the given DocumentFragment bypass the sanitizer. Most actual use in extensions will probably use the step 1 support which has Parsoid recognizing strip state markers embedded in generated wikitext and parsing them into the appropriate marker.

MSantos removed a project: Content-Transform-Team-WIP.Oct 5 2023, 2:42 PM

cscott added a parent task: T352453: Invoke ParserBeforeInternalParse from Parsoid.Nov 30 2023, 4:27 PM

cscott mentioned this in T352453: Invoke ParserBeforeInternalParse from Parsoid.Nov 30 2023, 4:30 PM

cscott mentioned this in T353697: Parsoid/legacy parser {{Pre}} template rendering difference.Feb 6 2024, 10:10 PM

^ I pointed @Arlolra at this task as a reasonable way to handle "special page transclusions not at top level" -- the nested special page transclusion should stick the HTML into strip state, and the parsoid will pull out the html from the strip state and tokenize it as a DomFragment token.

cscott mentioned this in T356718: Support nested special page transclusion?.Mar 11 2024, 8:44 PM

Extension API: strip state issuesOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

Extension API: strip state issues
Open, HighPublic
Actions

Related Objects
Search...