Page MenuHomePhabricator

data-mw info is clobbered by template annotations
Open, MediumPublic

Description

When a node that would normally contain data-mw info is in first encapsulation position, we drop all that info in favour of the template info.

This is fine for serialization, since the template would take precedence, but other uses of the HTML may still have a need for it.

In particular, the media post-processing pass from T153080 wants to make use of options stored there,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/463125/14/lib/wt2html/pp/processors/wrapTemplates.js

An earlier version of Cite extension touched on it as well,
https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/436192/
but that was refactored out in,
https://github.com/wikimedia/parsoid/commit/d8cc76c56f150966359d303884c2b25cfde249b5

@ssastry asks,

So, the question is if we should unconditionally stop clobbering the previously existing data-mw info and provide access to it in the output HTML in some other systematic way.

Event Timeline

matmarex added a subscriber: matmarex.

This issue also limits the ability of VE to handle template-generated references, in particular being able to reuse them by name (since the name data is clobbered). Related: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/463595

LGoto triaged this task as Medium priority.Apr 17 2020, 4:22 PM
LGoto moved this task from Backlog to Needs Investigation on the Parsoid board.

Seems like a "standard" way of combining data-mw (or data-parsoid) when there are multiple pieces on the same node would be useful.
Right now we just add keys together, but that only works as long as the keys don't conflict.
It might be better to switch to using a top-level array instead, which works even when the keys conflict.

typeof attributes can be added to the same attribute, separated by whitespace, and in fact the hasTypeOf machinery in DOMUtils takes special care to ensure that works. But it's quite possible we need to enforce use of an addTypeOf helper to keep us from clobbering the typeofs when we set them, which is probably your point...

I think conceptually, what data-mw represents is information about "content generators" (templates and extensions currently). Conceptually (ignoring the multi-part template mess for just a moment), generators are nested, i.e. a template can generate an extension can generate something else ... or an extension can generate a template can generate something else. So, what we need is a nesting representation, not an aggregation representation.

Aggregation right now is only available for template generators for the multi-part content templating scenarios via the "parts" key. But, conceptually, every piece of the parts key would represent an individual generator. Beyond that, we don't need aggregation.

So, if we figure out a nesting mechanism, I think we may be set.

Now that it is not 2013 and we know a whole lot more about all our generator types and our general future direction (nested content generators generating well-formed DOM fragments) I will see if I can work out a solution that follows a coherent logic. What I don't know is if there is a backward compatible solution or if we need to do a version bump. But, if we are going to do a version bump, we should probably co-ordinate that with <figure-inline> to <span> changes for inline images.

Given that the nested generator with collapsed wrappers can only ever happen when the template wrapper also happens to be an extension content wrapper, we can handle it by adding a 'nested-data-mw' key to the parts array element as follows:

data-mw={
   "parts": [
      { "template": {"target": {"wt": .. , "href": ... }, "params": { "key": { "wt": .. } }, "i": ...  }, nested-data-mw: { 'typeof': 'mw:Extension/blah', ... } },
   ]
}

The typeof is redundant, but might help clients.

When there are multiple parts to the templated content, it shouldn't be possible for the wrapper to have multiple typeof annotations and hence there cannot be any nesting of data-mw objects either.

Change 607632 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Support nested data-mw objects to avoid clobbering useful info

https://gerrit.wikimedia.org/r/607632

Change 615323 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Support nested data-mw/typeof in extensions

https://gerrit.wikimedia.org/r/615323

I proposed a more regular syntax for this on Signal, that is extensible and not special-cased for templates.
The idea was that we switch to data-mw-nested which has an about id as a top-level key. So:

<span typeof="mw:Transclusion" about="#mwt1" data-mw='{"parts":[.....]}'>...

is short for

<span typeof="mw:Transclusion" about="#mwt1" data-mw-nested='{"mwt1":{"parts":[....]},....}'>...

The outermost always goes in a data-mw, and when you get a clobber any non-outermost go in data-mw-nested, but from an API perspective internal to Parsoid we try to use getDataMwNested(aboutid) instead of getDataMw so that we never lose info. (The about id should always be unique.)

The advantage of nesting in data-mw rather than having a top-level data-mw-nested is that we can manage the inlining / outlining of the attribute (which we'll need to do for purpose of HTML payload size) easily. We don't have to mess with multiple attributes.

In any case, having a nested-dmw inside data-mw isn't tied to templates. It can be arbitrary and can use the aboutid-based key scheme you are proposing which makes sense. And, getDataMwNested(aboutid) can always check if aboutid matches the aboutid on the node in which case it returns the default outer level one.

So it seems that there are two use cases here. In one, there's a strict nesting: {{1x|{{1x|{{1x foo}} bar}} bat}} which either generates <span><span><span>foo</span></span><span> or else collapses the three templates into one span wrapper but then needs to describe all three template invocations.

Consider: {{1x|{{1x|{{1x|foo}}<p>bar}}<p>bat}} -- I need the different abouts to describe the first, second, and third paragraphs as belonging to different templates.

My proposal above was:

<p about="#mwt1 #mwt2 #mwt3" data-mw-nested='{"mwt1":{"parts":....},"mwt2":{"parts":...},"mwt3":{"parts:...."}'>
foo
</p>
<p about="#mwt1 #mwt2">
bar
</p>
<p about="#mwt1">
bat
</p>

and that *ought* to also apply if some of those nested thingies were extensions or other language variant markup, or other things with data-mw. For compatibility we'd treat data-mw as shorthand for setting data-mw[first element of about attribute]=getattribute(data-mw) when fetching data-mw-nested.

But in this use case since there's strict nesting you could also do something like what I understand to be proposed in https://gerrit.wikimedia.org/r/607632 --- you need to fix it to add multiple elements to the about attribute, and the each layer of nestedDmw is assumed to pop the first element off the about attribute and describe that:

<p about="#mwt1 #mwt2 #mwt3" typeof="mw:Transclusion" data-mw='{"parts":[...],nestedDmw:{"parts":[.....],nestedDmw:{"parts":[...]}}'>foo</p>
<p about="#mwt1 #mwt2">bar</p>
<p about="#mwt1">bat</p>

(The externalized data-mw map would also change.)

But not all use cases have as elegant a mapping. Embedded captions and language variant markup, for example:

$ echo '-{foo {{1x|bar}} bat}-' | php bin/parse.php --domain zh.wikipedia.org
<p data-parsoid='{"dsr":[0,22,0,0]}'><span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":"foo &lt;span about=\"#mwt1\" typeof=\"mw:Transclusion\" data-parsoid=&apos;{\"pi\":[[{\"k\":\"1\"}]],\"dsr\":[6,16,null,null]}&apos; data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"bar\"}},\"i\":0}}]}&apos;>bar&lt;/span> bat"}}' data-parsoid='{"fl":[],"src":"-{foo {{1x|bar}} bat}-","dsr":[0,22,2,2]}'></span></p>

Here we also have embedded data-mw, but we're hacking around the issue by expanding it as a full attribute and then embedding the HTML with that attribute. This bloats the HTML, because you have to fetch data-mw-variant to render the language variant and so you have to fetch the full data-mw and data-mw-parsoid even if you don't want it.

What I really want is to change the underlying mapping so that the top-level data-mw can also include the nested data-mw. Again, something like:

<span typeof="mw:LanguageVariant" data-mw-nested='{"mwt1":{"parts":[....]}}' data-mw-variant='{"disabled":{"t": "foo <span about=\"#mwt1"...."}'></span>

This proposal is probably not perfect yet. We have to consider how it is serialized when we fetch the data-mw as a json blob instead of embedded as an attribute, and whatever scheme we use has to be generalized to data-parsoid as well, since galleries and language variant markup (at least) needs to be able to preserve data-parsoid information for embedded selser when editing.

But what I'm hoping for is a generic system that can be decoupled from the specifics of template/extension/gallery/language variant markup and provides a means for getting a bit beyond the limits of HTML's attribute system. One way of thinking about things is as a shadow DOM, that (for example) the {{1x|{{1x|{{1x foo}} bar}} bat}} case is "really":

<div about="#mwt1"><div about="#mwt2"><div about="#mwt3"><p>foo</p></div><p>bar</p></div><p>bat</p></div>

And what we're doing is providing a generic mechanism for "hiding" the outer '<div>'s and putting all of their attributes on the <p> tags; and similarly the -{foo {{1x|bar}} bat}- case is "really":

<span><hidden-element>foo <span>bar</span> bat</hidden-element></span>

and again we're embedding 'hidden-element' inside the outer <span> tag (in the case of gallery, to prevent a hidden <p> tag from breaking the parse, and similarly wrt alternate language renderings for language variant markup).

Template-affected attributes are another case:

<div class="{{1x|foo}}">...</div>

wants to be "really":

<div class="<hidden-element>foo</hidden-element>">...</div>

where we hang the template information off the hidden-element, and have a uniform mechanism for recording the "real" tree structure (including data-mw and data-parsoid) associated with the class attribute. (We're currently expanding data-mw/data-parsoid as attributes and storing the shadom DOM as text in dataMw->attributes[..][1]['html'])

We should be able to name those hidden collapsed elements (we're using "about" tags for templates, perhaps we should be allocating about ids for embedded gallery captions and language variant markup) and then those collapsed elements should be able to embed data-mw and data-parsoid in a uniform manner (I'm proposing adding an 'about id' index inside the data-mw/data-parsoid data; though I'm not thrilled about using about as an extra index). The goal is to have a uniform mechanism for traversing into (and editing!) embedded content w/ data-mw and data-parsoid intact, without having to special case templates/template-affected attributes/extensions/language converter/gallery, as ContentUtils::shiftDSR does.

On reflection it seems like there are two separate issues here:

  1. the way we collapse wrappers leads to ambiguity; we should have a generic method specifically for making 'hidden' wrappers and reinflating them. The about mechanism and data-mw={"nested":...} is not a terrible way of doing this, but i'd be more comfortable if the nested property allowed including data from other attributes of the hidden wrappers as well. Perhaps:
<p about="#mwt1" data-mw={..., "nested":{"about":"mwt2", "data-parsoid": ..., "data-mw": {..., "nested":{"about":"mwt3"...}}>foo</p>
<p about="#mwt1" data-mw={"nested": {"about":"mwt2"}>bar</p>
<p about="#mwt1">bat</p>

(Note that as an optimization I'm not double-JSON encoding the embedded data-mw and data-parsoid.)

In the future we might want to simplify this code by simply creating the hidden wrappers in a pre-processing phase: using the 'nested' data-mw to create an actual '<parsoid-wrapper>' DOM element in the tree, and then teaching XMLSerializer to not actually emit these but instead to embed their contents in the first "real" child element inside them.

  1. The fact that attributes can't have DOM trees nested inside them. A number of things can be simplified to this case: template-affected attributes, of course, but also hidden media captions (just store them in a 'hidden-caption' attribute), language-converter markup (again, use a 'variants' attribute), etc. We can add a DOMDataUtils method called "getAttributeDOM(....)" and it should use a uniform mechanism to get a DOMFragment from the same document which shares the parent ID space and data-mw/data-parsoid bundles (ie getDataParsoid should work transparently on any node returned).

As a strawman:

public function getAttributeDOM(Node $node, string $name) {
   $dataMw = getDataMw($node);
   $value = $dataMw->attributes[$name] ?? null;
   if ($value) { return htmlString2domFragment($value); }
   if (!$node->hasAttribute($name)) { return null; }
   return $node->ownerDocument->createTextNode($node->getAttribute($value));
}

Then any traversal (like ContentUtils::shiftDSR) which wanted to be sure to traverse all embedded DOM would simply enumerate the attribute names and call getAttributeDOM() on them and recurse into them. It wouldn't have separate cases for figure-inline captions, template-affected attributes, language-converter, etc.

I was hoping to combine the mechanisms for #1 and #2 here -- the rough idea being that we could embed inline captions explicitly as DOM as long as there was a way to hide/collapse the hidden block elements, in the same way we hide the hidden wrapper elements -- but maybe that's a step too far.

Change 615323 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Preserve typeof on extension first encap wrapper nodes

https://gerrit.wikimedia.org/r/615323

Change 628944 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a10

https://gerrit.wikimedia.org/r/628944

Change 628944 merged by jenkins-bot:
[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.13.0-a10

https://gerrit.wikimedia.org/r/628944

Related: T87274: DOM nodes with multiple typeof values, T50772: Protect Parsoid-generated token-names, about-ids, typeofs from conflicts with user-provided names..

In T275082: Develop a spec for representing a DOM range in serialized Parsoid output the proposal is to have a way to represent a DOM *range*, both internally to Parsoid as well as externally as a serialization form. This is related to "issue #1" from the comment above, where we wanted to have a uniform way to represent "collapsed wrappers". However, the main difference is that T275082 ranges are *not* guaranteed to correspond to complete DOM subtrees (although maybe they should?), although they should still correspond to a contiguous set of nodes in an in-order traversal. So they can be represented as an about ID which is applied to a number of nodes, but adding a temporary <parsoid-wrapper> element to the internal DOM tree is probably not (quite) sufficient.

Change 607632 abandoned by Subramanya Sastry:

[mediawiki/services/parsoid@master] Support nested data-mw objects to avoid clobbering useful info

Reason:

Any proposal / solution for T275082 will address this more generally

https://gerrit.wikimedia.org/r/607632