Unexpected DOM diff on a <ref> node after an edit
Closed, ResolvedPublic0 Estimated Story Points
Actions

Description

See bug report which is reproducible.

I followed the DOM dumping instructions on the debugging page to get the DOM in VE before/after a trivial edit.

I also fetched the data-parsoid for the tid from RESTBase and saved it locally (and edited it to add a "parsoid" wrapper that the Parsoid API expects).

I then ran the following script:

parse.js --html2wt --dump dom:post-dom-diff --selser --pbinfile /tmp/dp --oldhtmlfile /tmp/orig.html --oldtextfile /tmp/wt < /tmp/new.html

Examining the dumped DOM with dom-diff annotations, I indeed find that the <ref> node has marked as having its wrapper modified. I looked at the HTML files and there seems to be some diffs in the DOM nodes. I am stopping here and filing this bug report. I don't know whether this is a VE issue (not sure why the DOM node changed, even if the content is identical), or whether this is an edge case inside Parsoid in terms of its expectations around how <ref> nodes are represented.

Details

	Subject	Repo	Branch	Lines +/-
	Update VE core submodule to master (0217e5819)	mediawiki/extensions/VisualEditor	master	+1 -1
	Match entity encoding in HTML comments to Parsoid	VisualEditor/VisualEditor	master	+6 -7

Customize query in gerrit

Related Objects

Mentioned In: T249940: Audit whether DOMDiff code correctly handles encoded entities in embedded HTML strings in data-mw properly
T196025: Identify editing-client DOM transformations that Parsoid needs to account for during DOM diffs
Mentioned Here: rGVED0217e581966d: Merge "Match entity encoding in HTML comments to Parsoid"
T249940: Audit whether DOMDiff code correctly handles encoded entities in embedded HTML strings in data-mw properly
rGVED48f71d121b3f: Encode as few characters in comments as possible

Event Timeline

• ssastry created this task.Sep 5 2016, 4:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 5 2016, 4:10 AM

Separately, looks like we need a helper script to do some of these steps (finding the tid, fetching data-parsoid from RB, massaging it by adding the external wrapper) for ease of debugging (since things are a bit more involved compared to how they were a couple years back).

This is a "bug" in Parsoid's dom-diff code it appears.

Looks like VE is inlining the HTML for the ref into the data-mw blob. Parsoid then treats the change in data-mw as an edit of the <ref> (the change is that data-mw in orig HTML looks like { id: ... } whereas the new data-mw looks like { id: ..., html: ...})

Parsoid's DOM diff could be smarter about ignoring the inlining of the data-mw.html for the mw:Extension/ref span node if the HTML has not actually changed.

But, separately, why is VE inlining the HTML for unedited <ref>s? Is this a new change?

Jdforrester-WMF added a project: Cite.Sep 6 2016, 7:21 PM

Jdforrester-WMF set the point value for this task to 0.

Jdforrester-WMF moved this task from To Triage to External and Administrivia on the VisualEditor board.Sep 6 2016, 7:23 PM

The ideal fix here would be for VE not to inline the HTML for unedited <ref> nodes.

Otherwise, we'll have to do a rewrite of the data-mw equality code which will need special-case checks OR some amount of custom normalization for <ref> nodes for the existing code to work. https://github.com/wikimedia/parsoid/blob/master/lib/html2wt/DOMDiff.js#L65-L155 is the existing code.

But, meanwhile, I'll see if there is a simple normalization that I can do to preserve the cleanliness and maintainability of the existing code.

Separately, I am surprised that a lot more edits aren't being dirtied if VE does this kind of inlining routinely -- so, this leads me to suspect that this is either something new, or VE has hit some edge case that is causing this.

https://github.com/wikimedia/mediawiki-extensions-Cite/blob/master/modules/ve-cite/ve.dm.MWReferenceNode.js#L103

Ideally VE wouldn't have to work out where to put the reference body at all (i.e. first instance of the reference, unless explicitly defined otherwise in the wikitext) and Parsoid would handle all that for us.

Izno moved this task from Unsorted backlog to External on the Cite board.Sep 6 2016, 8:43 PM

I wonder if it is the "" in the wikitext that Parsoid represents as "" that is the source of the problem. Inside VE, once entities are decoded, the two strings might show up as different.

The VE code that is affected is https://github.com/wikimedia/mediawiki-extensions-Cite/blob/master/modules/ve-cite/ve.dm.MWReferenceNode.js#L167-L171

• Elitre subscribed.Sep 12 2016, 9:55 AM

*bump* @Esanders, see comment above. Can you take a look?

• ssastry triaged this task as Medium priority.Nov 17 2016, 10:57 PM

• ssastry moved this task from Needs Triage to html2wt on the Parsoid board.Dec 18 2017, 9:43 PM

Arlolra mentioned this in T196025: Identify editing-client DOM transformations that Parsoid needs to account for during DOM diffs.Jun 1 2018, 2:30 PM

• ssastry edited projects, added Parsoid-Edit-Support; removed Parsoid.Sep 20 2018, 4:28 PM

Aklapper edited projects, added Parsoid; removed Parsoid-Edit-Support.Feb 29 2020, 5:15 PM

In T144708#2612695, @ssastry wrote:

I wonder if it is the "" in the wikitext that Parsoid represents as "" that is the source of the problem. Inside VE, once entities are decoded, the two strings might show up as different.

The VE code that is affected is https://github.com/wikimedia/mediawiki-extensions-Cite/blob/master/modules/ve-cite/ve.dm.MWReferenceNode.js#L167-L171

In T144708#2804260, @ssastry wrote:

*bump* @Esanders, see comment above. Can you take a look?

@matmarex, can you investigate?

• ssastry changed the task status from Open to Stalled.Mar 10 2020, 12:25 AM

• ssastry moved this task from html2wt to Needs Investigation on the Parsoid board.

• ssastry claimed this task.Apr 6 2020, 4:52 PM

I am going to move this to Non-Parsoid Tasks column on the workboard till this is investigated on the VE end.

• ssastry moved this task from Needs Investigation to Non-Parsing-Team Tasks on the Parsoid board.Apr 6 2020, 9:42 PM

@ssastry Sorry, I didn't see your message in March…

Indeed, it appears that VE encodes the text of comments differently.

It seems that we should simply revert rGVED48f71d121b3f: Encode as few characters in comments as possible (https://gerrit.wikimedia.org/r/c/VisualEditor/VisualEditor/+/202711) to make them the same again.

Change 587363 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[VisualEditor/VisualEditor@master] Match entity encoding in HTML comments to Parsoid

https://gerrit.wikimedia.org/r/587363

gerritbot added a project: Patch-For-Review.Apr 7 2020, 10:21 PM

matmarex changed the task status from Stalled to Open.Apr 7 2020, 10:22 PM

matmarex moved this task from Needs Discussion / Investigation to Code Review on the Editing-team (Q3 2019-2020 Kanban Board) board.

It seems like the VE approach avoids encoding '-' in comments, which I imagine is quite common, and would make the wikitext comments a lot less readable. Shouldn't Parsoid be "fixed" instead?

I wrote the encoding code, and it was originally carefully the same in Parsoid and VE. The encoding was carefully designed to ensure that any valid string can be encoded as a comment, in exactly one way (to avoid exactly this sort of dirty diff problem). The parsing rules around dashes are pretty complicated (https://html.spec.whatwg.org/multipage/parsing.html#comment-state) and you have to consider also the parsing rules for comments in wikitext (which of course are similar but not identical).

In terms of what editors see, you are probably interested more in what the *wikitext* representation of this comment text looks like. That's not a VE patch, that's adjusting WTUtils::decodeComment in Parsoid. Rather long description in https://github.com/wikimedia/parsoid/blob/master/src/Utils/WTUtils.php#L631 and short summary is that we are already trying hard to avoid encoding dashes in wikitext.

In T144708#6039620, @Esanders wrote:

It seems like the VE approach avoids encoding '-' in comments, which I imagine is quite common, and would make the wikitext comments a lot less readable. Shouldn't Parsoid be "fixed" instead?

Independent of that, this is an interesting scenario. This is a html2html dirty diff when it goes through VE.

But, given what Ed flagged above, it seems like Parsoid clients should be insulated from knowing how entity encoding is done in Parsoid .. i.e. the html2html diffing should be done on decoded entities / comments and so, the specifics of how VE encodes its entities should not matter. Given that Parsoid already does diffing on the DOM, this is probably an edge case where in the recursive diffing on embedded HTML strings is somehow failing to properly decode entities.

So, maybe hold on to this patch and we'll see if we can figure this out.

• ssastry moved this task from Non-Parsing-Team Tasks to Needs Investigation on the Parsoid board.Apr 8 2020, 4:33 PM

No, VE should definitely fix its encoding; there are edge cases in HTML comment parsing which it doesn't account for and the thing its trying to do is broken anyway because the encoding in the wikitext is determined by Parsoid, not VE.

There may be improvements we can make in html diffing, but that's orthogonal.

For example, as described in https://github.com/wikimedia/parsoid/blob/master/src/Utils/WTUtils.php#L632, --!> is valid syntax to close an HTML comment. As it stands right now, if a user typed '--!>' as part of the text in a comment node in VE, VE would generate broken HTML to send back to Parsoid. The original comment encoding ensured that would not occur, and in fact it guaranteed that the user could type *anything they liked* as a comment node and it would get safely encoded as HTML, where it could be decoded in Parsoid back to *the literal text the user typed* and then properly encoded in wikitext as whatever we decided was the most pleasant way to do that. VE's job is to faithfully delivering the content the user typed to Parsoid, not to try to play with wikitext encodings.

I think we are effectively saying the same thing, but you also found a bug in VE's encoding. VE doesn't need to match Parsoid's encoding, just that it should properly encode HTML comments. If it turns out that it matches Parsoid's encoding, well and good. We should separately investigate if there is a problem in Parsoid's dom-diff algo when it comes to content embedded in data-mw attributes.

Well, there's no reason not to match Parsoid's encoding, and matching Parsoid's encoding would eliminate this bug. So I'm suggesting that the revert patch above should be applied to close this bug, then we can separately open a task for "more flexible dom-diff algorithm" if/when necessary. But that would (presumably) be a low priority issue once this regression is fixed.

In T144708#6044403, @cscott wrote:

Well, there's no reason not to match Parsoid's encoding, and matching Parsoid's encoding would eliminate this bug. So I'm suggesting that the revert patch above should be applied to close this bug, then we can separately open a task for "more flexible dom-diff algorithm" if/when necessary. But that would (presumably) be a low priority issue once this regression is fixed.

Works for me.

Change 587363 merged by jenkins-bot:
[VisualEditor/VisualEditor@master] Match entity encoding in HTML comments to Parsoid

https://gerrit.wikimedia.org/r/587363

Filed T249940: Audit whether DOMDiff code correctly handles encoded entities in embedded HTML strings in data-mw properly for Parsoid's DOMDiff code.

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptApr 10 2020, 5:38 PM

Change 588429 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0217e5819)

https://gerrit.wikimedia.org/r/588429

Change 588429 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0217e5819)