Page MenuHomePhabricator

Parsoid should provide top-level information about elements within templates
Open, Needs TriagePublicFeature

Description

Currently Parsoid provides minimal information about the contents of a template. It gives comprehensive information about the name of the template and the parameters passed to it, but there's nothing about what Parsoid generated beyond the markup we're given -- and the spec is silent on how much we can trust that to contain Parsoid attributes.

My goal for this information is to support template-defined/used references, which VE currently cannot see. I've done some speculative work on extracting this information by assuming that the template's internal markup can be trusted. This seems to somewhat work, but is resting on some potentially fragile assumptions.

From the VisualEditor perspective, the data contained on nodes within another Parsoid node is a pain to parse, because we assume that we can iterate through the document and entirely discard a node once we've identified it for handling. We're not set up for nodes that might be identified-and-handled and then need a separate conversion pass for their contents.

Thus I have a suggestion to hoist the data Parsoid has up onto the template node, as part of the data-mw attribute.

This is a current simplified infobox template:

{{Infobox|foo=This is a reference in the reflist<ref name="infobox-used"/>, and this is defined right here<ref name="infobox-defined">I am a referenced defined inside the reflist</ref>}}

As you can see, in the parameter this is creating a few <ref> usages.

This turns into this markup:

<table class="toccolours tpl-infobox" about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Infobox","href":"./Template:Infobox"},"params":{"foo":{"wt":"This is a reference in the reflist&lt;ref name=\"infobox-used\"/>, and this is defined right here&lt;ref name=\"infobox-defined\">I am a referenced defined inside the reflist&lt;/ref>"}},"i":0}}]}' id="mwAg">
<caption style="font-size: 125%;"><strong> SandboxReferences </strong></caption>

<tbody><tr><th>Foo</th><td>This is a reference in the reflist<sup about="#mwt2" class="mw-ref reference" id="cite_ref-infobox-used_1-0" rel="dc:references" typeof="mw:Extension/ref" data-mw='{"name":"ref","attrs":{"name":"infobox-used"}}'><a href="./SandboxReferences#cite_note-infobox-used-1" id="mwAw"><span class="mw-reflink-text" id="mwBA"><span class="cite-bracket" id="mwBQ">[</span>1<span class="cite-bracket" id="mwBg">]</span></span></a></sup>, and this is defined right here<sup about="#mwt3" class="mw-ref reference" id="cite_ref-infobox-defined_2-0" rel="dc:references" typeof="mw:Extension/ref" data-mw='{"name":"ref","attrs":{"name":"infobox-defined"},"body":{"id":"mw-reference-text-cite_note-infobox-defined-2"}}'><a href="./SandboxReferences#cite_note-infobox-defined-2" id="mwBw"><span class="mw-reflink-text" id="mwCA"><span class="cite-bracket" id="mwCQ">[</span>2<span class="cite-bracket" id="mwCg">]</span></span></a></sup></td></tr>

</tbody></table>

...and pulling out the mw-data for easier viewing:

{
  "parts": [
    {
      "template": {
        "target": {
          "wt": "Infobox",
          "href": "./Template:Infobox"
        },
        "params": {
          "foo": {
            "wt": "This is a reference in the reflist&lt;ref name=\"infobox-used\"/>, and this is defined right here&lt;ref name=\"infobox-defined\">I am a referenced defined inside the reflist&lt;/ref>"
          }
        },
        "i": 0
      }
    }
  ]
}

However, the generated <ref> markup inside the template output contains useful information. It could be pulled up into the parent's mw-data like this:

{
  "parts": [
    {
      "template": {
        "target": {
          "wt": "Infobox",
          "href": "./Template:Infobox"
        },
        "params": {
          "foo": {
            "wt": "This is a reference in the reflist&lt;ref name=\"infobox-used\"/>, and this is defined right here&lt;ref name=\"infobox-defined\">I am a referenced defined inside the reflist&lt;/ref>"
          }
        },
        "i": 0
      }
    }
  ],
  "contains": [
    {"name":"ref","attrs":{"name":"infobox-used"},"about":"#mwt2"},
    {"name":"ref","attrs":{"name":"infobox-defined"},"body":{"id":"mw-reference-text-cite_note-infobox-defined-2"},"about":"#mwt3"}
  ]
}

That'd be enough information for our current needs, and the about attributes getting added into the data would make it trivial to extract more from the markup if needed without requiring a full conversion pass.

The potential drawback of providing this would be that I used a very simple example above, and in complicated template situations (e.g. the average enwiki Infobox) there might be a lot of duplication from pulling all the data-mw up like this. This could either just be accepted as the cost of an improvement, or could potentially be mitigated by exploiting the way the spec currently has no guarantees about the contents of a template and stripping the internal mw-datas then specifying that people actually parsing contents may need to reconstruct the mw-data from the wrapper.

There's also potential questions about how nested templates should be represented -- should everything be pulled up into the top-level element's contains, or would it be expected to potentially recurse? (I'd hope for the former, but the latter might be simpler to implement.)

Related Objects

Event Timeline

The support for references lists within templates would defenitly also profit from that. See T399937: [Epic] Known issues and workarounds with VE and {{reflist}}

tl;dr: There are already some hard-coded workarounds in VE to cover different ways how the {{reflist}} template could be build.

We discussed this during Content-Transform-Team tech forum today. Our current position is that traversal of transcluded content is generally ok, and we have documented better the expectations clients should be able to rely on.

We expect T419697: Parsoid should provide "template source" information for certain nested constructs would provide information about nested template sources where necessary; let us know if that would meet your needs.