Page MenuHomePhabricator

Update data-mw encoding for <ref> tags to point to the HTML content in the <references /> output rather than duplicating it
Closed, ResolvedPublic0 Estimated Story Points

Description

Consider this wikitext:

X <ref>This is a '''ref'''</ref>
<references/>

The parsed HTML for it is:

<p data-parsoid='{"dsr":[0,32,0,0]}'>X <span about="#mwt2" class="reference" id="cite_ref-1" rel="dc:references" typeof="mw:Extension/ref" data-parsoid="{&quot;src&quot;:&quot;&lt;ref>This is a '''ref'''&lt;/ref>&quot;,&quot;dsr&quot;:[2,32,5,6]}" data-mw='{"name":"ref","body":{"html":"This is a &lt;b data-parsoid=&#39;{\"dsr\":[17,26,3,3]}&#39;>ref&lt;/b>"},"attrs":{}}'><a href="#cite_note-1">[1]</a></span></p>
<ol class="references" typeof="mw:Extension/references" about="#mwt4" data-parsoid='{"src":"&lt;references/>","dsr":[33,46,2,2]}' data-mw='{"name":"references","attrs":{}}'><li about="#cite_note-1" id="cite_note-1"><span rel="mw:referencedBy"><a href="#cite_ref-1">↑</a></span> This is a <b>ref</b></li></ol>

The HTML for the <ref> tag is present in the <ol> section. So, we can reduce side of data-mw by having a new property in data-mw for <ref> tags that points to the DOM node (ex: a CSS selector would do the trick).

Doing this properly to completion requires the following:

(a) Identifying fixes to make sure that the HTML is fully identical. In this example above, data-parsoid is missing from the html emitted in the references section.
(b) Proposed change to the spec with an updated data-mw.
(c) Checking in with VE and others to make sure nothing will break with this change.
(d) Creating dependent tasks for clients like VE to fix things on their end and deploy backward compatibility fixes first so that they work with old and new HTML
(e) Deploy Parsoid change.

Event Timeline

ssastry raised the priority of this task from to Medium.
ssastry updated the task description. (Show Details)
ssastry subscribed.

Ed, Roan, and I had a very quick chat about this on Friday (Jan 31st) at the WMF offices. It looks like this could work for them. They do something equivalent on the VE end anyway. Gabriel and I had some chat about this earlier.

In terms of spec changes, there are a few things to keep in mind:
(a) We may need a wrapper around the reference text so that we can reference it directly as #<li-cite-id-here>.<wrapper-classname>. The Cite CSS patch adds a wrapper which can be used.
(b) Now that data-mw will no longer have a HTML field, we need to ensure that <ref name='x'>..</ref> and <ref name='x' /> continues to be distinguishable
(c) For creating new refs, clients should still be able to just return a 'html' field in the data-mw of the <ref> (as it is right now) without have to bother with selectors to a <references> section. This would be important for inline editing on mobile for example where they can just return the <ref> without having to bother with creating / modifying the <references> section html and returning it.

The required information is basically already in the Parsoid HTML, because the href attribute on the link inside each reference matches the ID of the li in the references list.

So I'll start writing a patch for VE support for this assuming an HTML spec like:

<span typeof="mw:Extension/ref" data-mw='{"name":"ref","reuse":true}'>
    <a href="#cite_note-1">[1]</a>
</span>
[...]
<ol typeof="mw:Extension/references">
    <li id="#cite_note-1">
        <span rel="mw:referencedBy">....</span>
        Content goes here
   </li>
</ol>

I feel like the ID naming scheme could do with being changed, but the above is a subset of what Parsoid already outputs, except for the "reuse":true part (which is equivalent to the presence/absence of "body":{"html":"..."} anyway). So I could already start developing against this and VE could already use ref contents from the references list without even needing a change in the Parsoid output, and then once that's done we could flip the switch for Parsoid.

How about "body": {"id": "cite-note-1"} instead of reuse: true?

@GWicke how does that distinguish between a reference definition and a re-use? I think we mentioned in our meeting that we don't really care about that information, as long as you guys preserve it in reasonable circumstances. At the moment we have a lot of code working out which reference was the defining reference.

Correction: we don't actually need the "reuse":true part at all, because with the new DOM spec VE won't have to care which reference defines the content; that'll be Parsoid's job.

How about "body": {"id": "cite-note-1"} instead of reuse: true?

I'd rather not duplicate the ID. Right now it's already set on the <a> tag, and should continue to be so clicking references in view mode will work. So why not just keep it there?

@Catrope, the id stuff is specific to Parsoid's cite implementation, and does not generalize to other extensions at all. I'd prefer to keep this generic if at all possible. The overhead of having the id string in data-mw is pretty minimal, especially with gzip.

So the current logic we implement in VE is described in ve.dm.MWReferenceNode.static.toDomElements

For newly created references we always place the contents in the first instance:

<ref name="foo">bar</ref><ref name="foo"/><ref name="foo"/>

When a reference originally had a body we always put the body back in that reference. If that reference is copied and pasted we make sure to strip the flag indicating it original had a body (contentsUsed in ve.dm.MWReferenceNode.prototype.getClonedElement), so we know not to add the body to the clones.

So, in terms of how to tackle this, my preference is as follows:

I would prefer that Parsoid does not yet (in this first step) deal with moves / deletions / etc. of refs, i.e. we should continue to have VE do what it has been doing when a reusable ref is deleted. So, we need a data-mw mechanism for VE to know what the primary reference is. So far, VE knew this by detecting the presence of data-mw.body.html .. .so, if Parsoid adds data-mw.body.id, VE can continue to figure out what the primary reference is. I imagine this will not be a big change in the VE code?

The reason I don't want to make a drastic change is because selser will break otherwise. Because, when primary named ref is deleted, the content has to move to another instance of that ref. which could be in an otherwise unmodified part of the DOM. However, plain DOM-Diff will not detect this (and break selser) unless the DOM is fixed up. So, we would need another pass (before DOM Diff runs) that fixes up the DOM as necessary. We can do that as a later step in Parsoid. At that point, we can relieve VE and other clients of having to fix up <ref>s when they are edited, unless of course, VE needs to do it for some other reason.

Does this sound reasonable? Or, am I missing something?

OK, that sounds reasonable. Turns out to be easier to code, as well.

T88650 has a specification and an implementation of backwards-compatible behavior in VE supporting data-mw.body.id.

Note that this implementation requires that the reflist output be changed to wrap the contents separately:

Parsoid currently outputs:

<li about="#cite_note-3" id="cite_note-3"><span rel="mw:referencedBy">[...]</span> Contents go here</li>

T88650 needs:

<li about="#cite_note-3" id="cite_note-3"><span rel="mw:referencedBy">[...]</span> <span id="mw-cite-foo">Contents go here</span></li>

Further notes:

  • mw-cite-foo as a naming scheme is just an example, I don't care what Parsoid ends up using
  • data-mw.body.id is looked up in the document, not in the reflist (there might be multiple reflists after all), so Parsoid would need to guarantee this ID is unique (wikitext input containing <span id="mw-cite-foo"> should not be able to break this)
  • Parsoid does not currently render missing reflists; it would need to start doing this for this scheme to work
    • The PHP parser used to render an error for missing reflists, but since https://gerrit.wikimedia.org/r/#/c/141583/ it renders missing reflists at the bottom of the page; so Parsoid implementing this would bring it closer to the PHP parser's behavior
    • Note that multiple groups can be missing a reflist, in which case multiple reflists need to be rendered

Note that this implementation requires that the reflist output be changed to wrap the contents separately:

Parsoid currently outputs:

<li about="#cite_note-3" id="cite_note-3"><span rel="mw:referencedBy">[...]</span> Contents go here</li>

T88650 needs:

<li about="#cite_note-3" id="cite_note-3"><span rel="mw:referencedBy">[...]</span> <span id="mw-cite-foo">Contents go here</span></li>

The new Parsoid HTML in the patch for T86782 already wraps the actual reference text in a <span class="reference-text">:

<li about="#cite_note-1" id="cite_note-1"><a href="#cite_ref-1" rel="mw:referencedBy"><span class="mw-linkback-text">↑ </span></a><span class="reference-text">foo boo</span></li>

It wouldn't be difficult to add an unique id there too.

Further notes:

  • mw-cite-foo as a naming scheme is just an example, I don't care what Parsoid ends up using

I've been trying to output ids identical to what the Cite.php extension produces… But this would be a new one, right?

  • data-mw.body.id is looked up in the document, not in the reflist (there might be multiple reflists after all), so Parsoid would need to guarantee this ID is unique (wikitext input containing <span id="mw-cite-foo"> should not be able to break this)

In the case of wikitext input with the same id (like <span id="mw-cite-foo">), should Parsoid just generate a new id?

  • Parsoid does not currently render missing reflists; it would need to start doing this for this scheme to work
    • The PHP parser used to render an error for missing reflists, but since https://gerrit.wikimedia.org/r/#/c/141583/ it renders missing reflists at the bottom of the page; so Parsoid implementing this would bring it closer to the PHP parser's behavior
    • Note that multiple groups can be missing a reflist, in which case multiple reflists need to be rendered

I've created T88660 for this. The patch you linked to only does so for the default references group, though. Testing at https://en.wikipedia.org/wiki/User:Marcoil/Tests/Cite confirms that refs in groups not explicitly invoked in a <references/> produce an error, not a new reflist.


A more general question about implementing this: Would it be worth to do so on top of the Cite CSS patch? That one already changes the produced HTML quite a bit…

@Catrope, so, there is another selser issue that I had not fully thought through and that just occured to me .. So, if a reference is edited in VE, the data-mw for the corresponding <ref> should somehow reflect a change, otherwise, selser will happily serialize the entire <ref> with the old wikitext. I am thinking what might be a good way to introduce an edit into that data-mw. Maybe add back the edited html? or change the id? @GWicke .. any smarter ideas?

@Catrope, so, there is another selser issue that I had not fully thought through and that just occured to me .. So, if a reference is edited in VE, the data-mw for the corresponding <ref> should somehow reflect a change, otherwise, selser will happily serialize the entire <ref> with the old wikitext. I am thinking what might be a good way to introduce an edit into that data-mw. Maybe add back the edited html? or change the id? @GWicke .. any smarter ideas?

In my current implementation, any changes are still written to data-mw.body.html, not to the <li>. So data-mw.body.id is only used as a read-only interface and only if data-mw.body.html isn't present. data-mw.body.html is still the only interface used for writing.

Does that make sense?

@Catrope, so, there is another selser issue that I had not fully thought through and that just occured to me .. So, if a reference is edited in VE, the data-mw for the corresponding <ref> should somehow reflect a change, otherwise, selser will happily serialize the entire <ref> with the old wikitext. I am thinking what might be a good way to introduce an edit into that data-mw. Maybe add back the edited html? or change the id? @GWicke .. any smarter ideas?

In my current implementation, any changes are still written to data-mw.body.html, not to the <li>. So data-mw.body.id is only used as a read-only interface and only if data-mw.body.html isn't present. data-mw.body.html is still the only interface used for writing.

Does that make sense?

Yes, that works.

Change 191593 had a related patch set uploaded (by Marcoil):
WIP: T88290: Only output <ref> contents in <references>

https://gerrit.wikimedia.org/r/191593

Patch-For-Review

In the (admittedly extreme) case of [[Barack_Obama]], the removal of duplicates reduces the HTML size by a little over 1M ;)

The current patch can't remove the duplication for <ref>s inside <references>, so the actual size reduction in [[Barack_Obama]] is lower: ~884K. I'll keep investigating to find a way to remove those cases too.

The current patch can't remove the duplication for <ref>s inside <references>, so the actual size reduction in [[Barack_Obama]] is lower: ~884K. I'll keep investigating to find a way to remove those cases too.

Latest patch fixes this case too, and improves size reductions to more than 1.1M.

Change 191593 merged by jenkins-bot:
T88290: Only output <ref> contents in <references>

https://gerrit.wikimedia.org/r/191593

Change 193417 had a related patch set uploaded (by Marcoil):
T88290: Keep separate content for multiple <ref>s with same name

https://gerrit.wikimedia.org/r/193417

The current patch creates a problem for pages that have multiple <ref>s with the same name but different content. For an example, see the <ref> named "DeVries2007" in https://en.wikipedia.org/wiki/History_of_weapons.

To fix this, the patch at https://gerrit.wikimedia.org/r/193417 checks if a <ref> has different content from the first one with the same name and in that case it outputs its contents in body.html as before.

So, this example

A <ref name="foo">Foo one</ref>
B <ref name="foo">Foo two</ref>
C <ref name="foo" />

would render the <ref>s as

A <span … data-mw='{"name":"ref","body":{"id":"mw-reference-text-cite_note-foo-1"},"attrs":{"name":"foo"}}'>…</span>
B <span … data-mw='{"name":"ref","body":{"html":"Foo two"},"attrs":{"name":"foo"}}'>…</span>
C <span … data-mw='{"name":"ref","attrs":{"name":"foo"}}'>…</span>

This preserves the different content for the 2nd <ref>.
@Catrope, would this work on the VE side?

Change 193417 merged by jenkins-bot:
T88290: Keep separate content for multiple <ref>s with same name

https://gerrit.wikimedia.org/r/193417

All patches now in production.