Page MenuHomePhabricator

PP passes that run `atTopLevel` omit HTML stashed in data-mw
Open, MediumPublic

Description

In particular, there're the captions stored for inline media.

The cite extension has special handling for this,
https://github.com/wikimedia/parsoid/blob/1e920f1498b52bcabc3d675d2ac73e34d1113c26/lib/ext/Cite/References.js#L337-L373

but we should maybe find a more general mechanism.

Event Timeline

Change 638195 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Add a localization html2html pass

https://gerrit.wikimedia.org/r/638195

Change 638195 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a localization html2html pass

https://gerrit.wikimedia.org/r/638195

Change 641308 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a17

https://gerrit.wikimedia.org/r/641308

Change 641308 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a17

https://gerrit.wikimedia.org/r/641308

Yeah, I'd expect language converter markup to be affected too (as another source of "a DOM tree serialized and embedded in data-mw"). There's a similar issue w/r/t traveresal -- some users will want to be sure to traverse even embedded HTML. I have a vague recollection there's a phab task and/or a generic traversal mechanism somewhere for that.

I keep saying this, but if our internal representation more naturally allowed "DOM tree valued attributes" this wouldn't keep coming up as a corner case.

I have a vague recollection there's a phab task and/or a generic traversal mechanism somewhere for that.

We have ParsoidExtensionAPI:: processHTMLHiddenInDataAttributes()
https://github.com/wikimedia/parsoid/blob/master/src/Ext/ParsoidExtensionAPI.php#L485

And also code in ContentUtils:;shiftDSR()
https://github.com/wikimedia/parsoid/blob/master/src/Utils/ContentUtils.php#L178-L226

Change 787550 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Only lint content defined by a specific ref

https://gerrit.wikimedia.org/r/787550

Change 787578 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] [WIP] Linter html stashed in data-mw of mw:Extension/references

https://gerrit.wikimedia.org/r/787578

Change 787550 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Only lint content defined by a specific ref

https://gerrit.wikimedia.org/r/787550

Change 787578 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Lint html stashed in data-mw of mw:Extension/references

https://gerrit.wikimedia.org/r/787578

Change 792236 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Change 792236 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.16.0-a8

https://gerrit.wikimedia.org/r/792236

Change 928130 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Lint stripped tags in nested content

https://gerrit.wikimedia.org/r/928130

We have ParsoidExtensionAPI:: processHTMLHiddenInDataAttributes()
https://github.com/wikimedia/parsoid/blob/master/src/Ext/ParsoidExtensionAPI.php#L485

Note that in https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/928564 processHTMLHiddenInDataAttributes was renamed processAttributeEmbeddedHTML

We also have embedsHTMLInAttributes as a property, which could be synonymous with outputHasCoreMwDomSpecMarkup if we institute a general mechanism.

(I still think we need an quote/unquote mechanism for outputHasCoreMwDomSpecMarkup so we can "leave" parsoid context and then "re-enter" it for extensions which generate mixed content; in this context that means we'd skip attribute handling in some subtree, and then restart it in a lower subtree.)

Change 821281 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Rich attribute support

https://gerrit.wikimedia.org/r/821281

Change 932012 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] DOMTraverser: Add option to process attribute embedded HTML

https://gerrit.wikimedia.org/r/932012

Change 932038 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] WIP: Enable processing of attribute-embedded-HTML in some DOM passes

https://gerrit.wikimedia.org/r/932038

Change 932012 merged by jenkins-bot:

[mediawiki/services/parsoid@master] DOMTraverser: Add option to process attribute embedded HTML

https://gerrit.wikimedia.org/r/932012

Change 933664 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Enable processing of attribute-embedded-HTML in some DOM passes

https://gerrit.wikimedia.org/r/933664

Change 932038 had a related patch set uploaded (by Arlolra; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Enable processing of attribute-embedded-HTML in some DOM passes

https://gerrit.wikimedia.org/r/932038

Change 932038 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Enable processing of attribute-embedded-HTML in some DOM passes

https://gerrit.wikimedia.org/r/932038

Change 928130 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Add tests for linting stripped tags in nested content

https://gerrit.wikimedia.org/r/928130

Change 936777 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a16

https://gerrit.wikimedia.org/r/936777

Change 936777 merged by jenkins-bot:

[mediawiki/vendor@master] Bump parsoid to 0.18.0-a16

https://gerrit.wikimedia.org/r/936777

This may be done now with all the above patches and with https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/940414 that is explicit about which passes run on nested pipelines. We now also have support for extensions to expose HTML embedded in their attributes. So, this is effectively done. Parts of the rich attribute support work can improve the internal APIs and potentially address performance. But, in terms of correctness, I think we are done for now as far as I can tell.

Change 951170 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/services/parsoid@master] Document DOM processors need to process attribute-embedded-HTML

https://gerrit.wikimedia.org/r/951170

Change 951170 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Document DOM processors need to process attribute-embedded-HTML

https://gerrit.wikimedia.org/r/951170

Change 956459 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-v23

https://gerrit.wikimedia.org/r/956459

Change 956459 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.18.0-v23

https://gerrit.wikimedia.org/r/956459

Change 962092 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Rich attribute support, phase 1a

https://gerrit.wikimedia.org/r/962092

Change 962673 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/vendor@master] Add wikimedia/json-codec 2.0.0

https://gerrit.wikimedia.org/r/962673

Change 962673 merged by jenkins-bot:

[mediawiki/vendor@master] Add wikimedia/json-codec 2.2.1

https://gerrit.wikimedia.org/r/962673

This may be done now with all the above patches and with https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/940414 that is explicit about which passes run on nested pipelines. We now also have support for extensions to expose HTML embedded in their attributes. So, this is effectively done. Parts of the rich attribute support work can improve the internal APIs and potentially address performance. But, in terms of correctness, I think we are done for now as far as I can tell.

I interpret this task as being about "the more general mechanism" not the specific workarounds we've applied so far.

In that vein: @ihurbain is working on the content postprocessing pipeline, and now there are more ways content gets "squashed" together than just Parsoid. In addition to subtrees containing extension content mediated by Parsoid, we also have @daniel's MCR mechanism with the combineOutput method of RenderedRevision that can result in output from several different sources being combined in the HTML tree ultimately run through post-processing.

My initial thought was that we'd label these inside the HTML with a <div> or <span> wrapper, as floated in comments on https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/831077/ (Bikeshed alert):

typeof="mw:Raw" means Parsoid shouldn't touch it, except for subtrees marked mw:DOM; and typeof="mw:DOM" is applied to the root element in Parsoid's output and means parsoid /can/ touch it, apart from subtrees marked mw:Raw. That collapses the entire tunneling issue to "your tunneled HTML can't contain typeof="mw:DOM" but anything else is protected from parsoid".

Then extensions don't have to register an option, they just need to ensure that their top-level element has an appropriate typeof; and if they were just outputting something from parsoid it will have the typeof="mw:DOM" on the top-level element by default so they don't need to do anything special.

The if an MCR pass wants to protect its squashed ParserOutput content from Parsoid interference, all it has to do is slap a <div typeof="mw:Raw"> around it.

That doesn't work great with the regexp-based postprocessors we currently have inherited from our legacy codebase; regexps aren't great at matching up <div..> and </div> appropriately. Perhaps markup inspired by @ihurbain's annotation markup is better, something like:

<meta typeof="start" data-id="random uuid">
 .... 
<meta typeof="end" data-id="random uuid">

so long as the UUIDs are distinct and matching, the regexp will always be able to match and skip the appropriate section.