Page MenuHomePhabricator

Scraper: Parsoid attaches mw:Transclusion data in <ref> tag to unexpected DOM node
Open, Needs TriagePublicBUG REPORT

Description

I believe I found a Parsoid issue. Here is a random page to illustrate the issue: https://en.wikipedia.org/wiki/Darreh_Shir,_Chaharmahal_and_Bakhtiari?useparsoid=1. The wikitext contains <ref>{{GEOnet3|-3769380|Darreh Shir}}</ref>. This creates the following HTML:

<span id="mw-reference-text-cite_note-1" class="mw-reference-text">
	<span about="#mwt18" typeof="mw:Transclusion" id="mwHg" data-mw="{&quot;parts&quot;:[{&quot;template&quot;:{&quot;target&quot;:{&quot;wt&quot;:&quot;GEOnet3&quot;,&quot;href&quot;:&quot;./Template:GEOnet3&quot;},&quot;params&quot;:{&quot;1&quot;:{&quot;wt&quot;:&quot;-3769380&quot;},&quot;2&quot;:{&quot;wt&quot;:&quot;Darreh Shir&quot;}},&quot;i&quot;:0}}]}">Darreh Shir can be found at </span>
	<a rel="mw:WikiLink" href="//en.wikipedia.org/wiki/GEOnet_Names_Server" title="GEOnet Names Server" about="#mwt18" id="mwHw">GEOnet Names Server</a>
	<span about="#mwt18" id="mwIA">, at </span>
	<a rel="mw:ExtLink nofollow" href="http://geonames.nga.mil/namesgaz/" about="#mwt18" class="external text" id="mwIQ">this link</a>
	<span about="#mwt18" id="mwIg">, by opening the Advanced Search box, entering "-3769380"  in the "Unique Feature Id" form, and clicking on "Search Database".</span>
</span>

Note where the data-mw that mentions the original template name GEOnet3 is attached. It's attached to the first <span>, as if only that is generated by the template. But this is not true. All <a> and <span> elements in this example are generated by the template. There is either some container element missing where the data-mw should be attached, or it should be attached to the outer <span> instead (however, that is generated by the Cite code).

Event Timeline

ssastry subscribed.

That is handled by the about-id assignment ("#mwt18"). Parsoid markups up a DOM forest (made up of a contiguous list of dom nodes) as "template-affected" (typically, just the template output, but sometimes also swallows up content from the top level when the template's output is not well-structured DOM output).

We should probably fix this in our documentation in https://www.mediawiki.org/wiki/Specs/HTML/2.8.0#Regular_transclusions .. but at this time, clients implicitly understand this, but this phab task clarifies that this is a gap in our spec documentation and should be make explicit.

In general, for any node with a typeof and an about id, you should walk the DOM forward while the about id matches and all of those dom nodes effectively belong to the structure identified by the typeof & about id. But, typeof and data-mw is only attached to the first element in the list.

There is one edge case here that we had forgotten about and will be fixing -- exposed by T363170 (even though it is known inside Parsoid).

thiemowmde reopened this task as Open.EditedWed, May 15, 7:32 AM
thiemowmde added a project: Documentation.

I don't think I can find anything that explains the about attribute on the current https://www.mediawiki.org/wiki/Specs/HTML. Is this what you are referring to when you say there is a gap?

So far I was under the impression that an id like #mwt18 would refer to something somewhere else. How can I find the source when all I have is e.g. the second <span about="#mwt18">? Essentially, is "mwt18" ever declared anywhere? As I understand it so far the only thing I can do is to query all elements that have [about="#mwt18"] and either hope that the first one has the typeof and data-mw I'm looking for (is this a guarantee?) or use [about="#mwt18"][typeof] as my query.

I'm also curious what the "t" in "mwt" stands for?

I think it's fair to keep this ticket open and use it to solve this confusion, even if it's only a Documentation change. It's also a ticket we would like to track properly in our current sprint WMDE-TechWish-Sprint-2024-05-08.

thiemowmde renamed this task from Parsoid attaches mw:Transclusion data in <ref> tag to wrong DOM node to Scraper: Parsoid attaches mw:Transclusion data in <ref> tag to unexpected DOM node.Wed, May 15, 7:33 AM