Page MenuHomePhabricator

Develop a spec for representing a DOM range in serialized Parsoid output
Open, MediumPublic

Description

There are a number of instances where Parsoid needs to represent a DOM range. The obvious cases that Parsoid has worked with so far have been output of templates and extensions. In the cases where the DOM range for a template and an extension overlap, there is a clear nesting (ex: extension output contains templates OR template output contains some extension) and in those cases, Parsoid has simply resorted to privileging the outer nest and suppressing information about the inner nested component.

Given this stragegy, Parsoid has used a typeof on the first element of the DOM range to indicate the type of DOM range it is (mw:Transclusion, mw:Extension/*) and an unique about id that is assigned to all the elements of the DOM range.

Going forward, we might have other use cases for DOM ranges (ex: annotations -- see T261181) and we might also want to have all DOM ranges be extractable rather than arbitrarily pick the outermost nesting.

So, we need a different spec that lets a DOM node be part of multiple ranges and of different types. So, we need a different representation scheme for encoding these ranges that is efficient space-wise, intuitive, and also lets clients easily extract the various DOM ranges and manipulate them in an error-free manner without a lot of complexity. So, given these requirements, the typeof-aboutid mechanism we have been using so far will not work.

We may also need to get feedback from existing Parsoid clients as part of developing this new spec.

Event Timeline

https://phabricator.wikimedia.org/T214241#6849806 contains some earlier discussion, framed at the time as an issue of collapsing wrapper elements.

One big spec question to settle: are the ranges guaranteed to be complete DOM subtrees (or forests)? Or just contiguous nodes in an in-order traversal?

Using a pseudo-element <parsoid-wrapper> just for visualization, are we talking about:

<parsoid-wrapper typeof="mw:Translate">
<p> some text</p>
<table>....</table>
<div> ... </div>
</parsoid-wrapper>

or do we need to represent:

<div>
foo!
<parsoid-wrapper>
bar <b>bat</b>
</div>
<div>
baz
</parsoid-wrapper>
quux
</div>

A somewhat related question regards how non-element nodes like Text and Comment are marked, but the way we've been doing that is simply to add span wrappers when necessary. Ie:

<div>
foo <parsoid-wrapper>bar</parsoid-wrapper> bat
</div>

gets serialized as a "real" span wrapper:

<div>
foo <span ....>bar</span> bat
</div>

while

<parsoid-wrapper>
<div>
foo bar bat
</div>
</parsoid-wrapper>

gets either collapsed into the existing <div> or has a new wrapper element of the appropriate type (another <div> here) added.

We are looking at DOM forests, not a selection of contiguous nodes during inorder traversal. We don't want go down that other route - to handle cases like that for templates, we expand the DOM range to span a DOM forest.

As for non-element DOM nodes, there is no strong requirement to add / not-add span wrappers right now. During template wrapping, because of the specific solution we have made there for representing a DOM range, we add artificial span wrappers. But, for example, if we used an alternative representation (ex: meta-tags to start/end a range), we may not need span wrappers.

ssastry triaged this task as Medium priority.Feb 22 2021, 11:15 PM