Page MenuHomePhabricator

Unexpected DOM diff on a <ref> node after an edit
Closed, ResolvedPublic0 Estimated Story Points

Description

See bug report which is reproducible.

I followed the DOM dumping instructions on the debugging page to get the DOM in VE before/after a trivial edit.

I also fetched the data-parsoid for the tid from RESTBase and saved it locally (and edited it to add a "parsoid" wrapper that the Parsoid API expects).

I then ran the following script:

parse.js --html2wt --dump dom:post-dom-diff --selser --pbinfile /tmp/dp --oldhtmlfile /tmp/orig.html --oldtextfile /tmp/wt < /tmp/new.html

Examining the dumped DOM with dom-diff annotations, I indeed find that the <ref> node has marked as having its wrapper modified. I looked at the HTML files and there seems to be some diffs in the DOM nodes. I am stopping here and filing this bug report. I don't know whether this is a VE issue (not sure why the DOM node changed, even if the content is identical), or whether this is an edge case inside Parsoid in terms of its expectations around how <ref> nodes are represented.

Event Timeline

Separately, looks like we need a helper script to do some of these steps (finding the tid, fetching data-parsoid from RB, massaging it by adding the external wrapper) for ease of debugging (since things are a bit more involved compared to how they were a couple years back).

This is a "bug" in Parsoid's dom-diff code it appears.

Looks like VE is inlining the HTML for the ref into the data-mw blob. Parsoid then treats the change in data-mw as an edit of the <ref> (the change is that data-mw in orig HTML looks like { id: ... } whereas the new data-mw looks like { id: ..., html: ...})

Parsoid's DOM diff could be smarter about ignoring the inlining of the data-mw.html for the mw:Extension/ref span node if the HTML has not actually changed.

But, separately, why is VE inlining the HTML for unedited <ref>s? Is this a new change?

Jdforrester-WMF set the point value for this task to 0.

The ideal fix here would be for VE not to inline the HTML for unedited <ref> nodes.

Otherwise, we'll have to do a rewrite of the data-mw equality code which will need special-case checks OR some amount of custom normalization for <ref> nodes for the existing code to work. https://github.com/wikimedia/parsoid/blob/master/lib/html2wt/DOMDiff.js#L65-L155 is the existing code.

But, meanwhile, I'll see if there is a simple normalization that I can do to preserve the cleanliness and maintainability of the existing code.

Separately, I am surprised that a lot more edits aren't being dirtied if VE does this kind of inlining routinely -- so, this leads me to suspect that this is either something new, or VE has hit some edge case that is causing this.

Ideally VE wouldn't have to work out where to put the reference body at all (i.e. first instance of the reference, unless explicitly defined otherwise in the wikitext) and Parsoid would handle all that for us.

I wonder if it is the "<!--|accessdate=2012-02-03 -->" in the wikitext that Parsoid represents as "<!--|accessdate=2012&#x2D;02&#x2D;03 -->" that is the source of the problem. Inside VE, once entities are decoded, the two strings might show up as different.

The VE code that is affected is https://github.com/wikimedia/mediawiki-extensions-Cite/blob/master/modules/ve-cite/ve.dm.MWReferenceNode.js#L167-L171

*bump* @Esanders, see comment above. Can you take a look?

I wonder if it is the "<!--|accessdate=2012-02-03 -->" in the wikitext that Parsoid represents as "<!--|accessdate=2012&#x2D;02&#x2D;03 -->" that is the source of the problem. Inside VE, once entities are decoded, the two strings might show up as different.

The VE code that is affected is https://github.com/wikimedia/mediawiki-extensions-Cite/blob/master/modules/ve-cite/ve.dm.MWReferenceNode.js#L167-L171

*bump* @Esanders, see comment above. Can you take a look?

@matmarex, can you investigate?

ssastry changed the task status from Open to Stalled.Mar 10 2020, 12:25 AM
ssastry moved this task from html2wt to Needs Investigation on the Parsoid board.

I am going to move this to Non-Parsoid Tasks column on the workboard till this is investigated on the VE end.

Change 587363 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[VisualEditor/VisualEditor@master] Match entity encoding in HTML comments to Parsoid

https://gerrit.wikimedia.org/r/587363

It seems like the VE approach avoids encoding '-' in comments, which I imagine is quite common, and would make the wikitext comments a lot less readable. Shouldn't Parsoid be "fixed" instead?

I wrote the encoding code, and it was originally carefully the same in Parsoid and VE. The encoding was carefully designed to ensure that any valid string can be encoded as a comment, in exactly one way (to avoid exactly this sort of dirty diff problem). The parsing rules around dashes are pretty complicated (https://html.spec.whatwg.org/multipage/parsing.html#comment-state) and you have to consider also the parsing rules for comments in wikitext (which of course are similar but not identical).

In terms of what editors see, you are probably interested more in what the *wikitext* representation of this comment text looks like. That's not a VE patch, that's adjusting WTUtils::decodeComment in Parsoid. Rather long description in https://github.com/wikimedia/parsoid/blob/master/src/Utils/WTUtils.php#L631 and short summary is that we are already trying hard to avoid encoding dashes in wikitext.

It seems like the VE approach avoids encoding '-' in comments, which I imagine is quite common, and would make the wikitext comments a lot less readable. Shouldn't Parsoid be "fixed" instead?

Independent of that, this is an interesting scenario. This is a html2html dirty diff when it goes through VE.

But, given what Ed flagged above, it seems like Parsoid clients should be insulated from knowing how entity encoding is done in Parsoid .. i.e. the html2html diffing should be done on decoded entities / comments and so, the specifics of how VE encodes its entities should not matter. Given that Parsoid already does diffing on the DOM, this is probably an edge case where in the recursive diffing on embedded HTML strings is somehow failing to properly decode entities.

So, maybe hold on to this patch and we'll see if we can figure this out.

No, VE should definitely fix its encoding; there are edge cases in HTML comment parsing which it doesn't account for and the thing its trying to do is broken anyway because the encoding in the wikitext is determined by Parsoid, not VE.

There may be improvements we can make in html diffing, but that's orthogonal.

For example, as described in https://github.com/wikimedia/parsoid/blob/master/src/Utils/WTUtils.php#L632, --!> is valid syntax to close an HTML comment. As it stands right now, if a user typed '--!>' as part of the text in a comment node in VE, VE would generate broken HTML to send back to Parsoid. The original comment encoding ensured that would not occur, and in fact it guaranteed that the user could type *anything they liked* as a comment node and it would get safely encoded as HTML, where it could be decoded in Parsoid back to *the literal text the user typed* and then properly encoded in wikitext as whatever we decided was the most pleasant way to do that. VE's job is to faithfully delivering the content the user typed to Parsoid, not to try to play with wikitext encodings.

I think we are effectively saying the same thing, but you also found a bug in VE's encoding. VE doesn't need to match Parsoid's encoding, just that it should properly encode HTML comments. If it turns out that it matches Parsoid's encoding, well and good. We should separately investigate if there is a problem in Parsoid's dom-diff algo when it comes to content embedded in data-mw attributes.

Well, there's no reason not to match Parsoid's encoding, and matching Parsoid's encoding would eliminate this bug. So I'm suggesting that the revert patch above should be applied to close this bug, then we can separately open a task for "more flexible dom-diff algorithm" if/when necessary. But that would (presumably) be a low priority issue once this regression is fixed.

Well, there's no reason not to match Parsoid's encoding, and matching Parsoid's encoding would eliminate this bug. So I'm suggesting that the revert patch above should be applied to close this bug, then we can separately open a task for "more flexible dom-diff algorithm" if/when necessary. But that would (presumably) be a low priority issue once this regression is fixed.

Works for me.

Change 587363 merged by jenkins-bot:
[VisualEditor/VisualEditor@master] Match entity encoding in HTML comments to Parsoid

https://gerrit.wikimedia.org/r/587363

Change 588429 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0217e5819)

https://gerrit.wikimedia.org/r/588429

Change 588429 merged by jenkins-bot:
[mediawiki/extensions/VisualEditor@master] Update VE core submodule to master (0217e5819)

https://gerrit.wikimedia.org/r/588429