Page MenuHomePhabricator

Dirty diff on test.wikipedia.org
Closed, ResolvedPublic

Description

See https://test.wikipedia.org/w/index.php?title=Wikip%C3%A9dia%3AFus%C3%A3o%2FCentral_de_fus%C3%B5es%2FBiblioteca_virtual%3B_Biblioteca_digital&type=revision&diff=409033&oldid=295399

The 3rd line from the bottom shows a dirty diff where the attribute order got reversed.

To investigate:

  • Rule out a RESTBase issue wrt geting old HTML
  • Check if this is a DOM Diff issue
  • Why did normal html2wt switch around attributes?

Event Timeline

ssastry triaged this task as High priority.Nov 19 2019, 3:54 PM
ssastry created this task.

Okay, looks like this is readily reproducible with a simple wikitext string which is <font face="Gotic" size="3">foo</font>.

Same wikitext on enwiki and testwiki.

  1. Open page in VE
  2. Add a period at the end of the sentence.
  3. Click 'Publish Changes' and 'Review your changes' (No need to save)

You will see that on enwiki, you will see no dirty diffs. On testwiki, you will see dirty diffs (the attributes are reordered).

Here isa dump of Parsoid generated HTML from those 2 wikis with the <head> stripped. For test.wikipedia.org, I fetch from Parsoid/PHP. For en.wikipedia.org, I fetch from Parsoid/JS. As you see, the output is identical. More imortantly, you see that the <font> attributes are in original wikitext order.

ssastry@scandium:~$ curl -x scandium.eqiad.wmnet:80 "http://test.wikipedia.org/w/rest.php/test.wikipedia.org/v3/page/html/User:SSastry_(WMF)%2Fsandbox/409048"
<!DOCTYPE html>... skipped header ...<body data-parsoid='{"dsr":[0,60,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" data-parsoid="{}"></section><section data-mw-section-id="1" data-parsoid="{}"><h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2]}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0]}'><font face="Gotic" size="3" data-parsoid='{"stx":"html","dsr":[22,60,28,7]}'>foo</font></p></section></body></html>

ssastry@scandium:~$ curl "http://localhost:8142/en.wikipedia.org/v3/page/html/User:SSastry_(WMF)%2FVE_Test%2FSandbox/927022357"
<!DOCTYPE html>... skipped header ...<body data-parsoid='{"dsr":[0,60,0,0]}' lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" data-parsoid="{}"></section><section data-mw-section-id="1" data-parsoid="{}"><h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2]}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0]}'><font face="Gotic" size="3" data-parsoid='{"stx":"html","dsr":[22,60,28,7]}'>foo</font></p></section></body></html>

Here is a dump of Parsoid-generated HML as fetched by RESTBase for these two pages. I've stripped out the header from both. They are identical as before. And, the <font> attributes are in the original wikitext order in RESTBase as well. So far so good.

[subbu@earth:~/work/wmf/parsoid] curl "https://test.wikipedia.org/api/rest_v1/page/html/User:SSastry_(WMF)%2Fsandbox"
<body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"></section><section data-mw-section-id="1" id="mwAg"><h2 id="Testing_T238665">Testing T238665</h2>
<p id="mwAw"><font face="Gotic" size="3" id="mwBA">foo</font></p></section></body>

[subbu@earth:~/work/wmf/parsoid] curl "https://en.wikipedia.org/api/rest_v1/page/html/User:SSastry_(WMF)%2FVE_Test%2FSandbox"
<body id="mwAA" lang="en" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr"><section data-mw-section-id="0" id="mwAQ"></section><section data-mw-section-id="1" id="mwAg"><h2 id="Testing_T238665">Testing T238665</h2>
<p id="mwAw"><font face="Gotic" size="3" id="mwBA">foo</font></p></section></body>

And, now here is the dump of HTML that VE loads. (Obained by following instructions @ https://www.mediawiki.org/wiki/Parsoid/Debugging#Dumping_HTML_DOM_in_VE )
From en.wikipedia.org:

<body id="mwAA" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr" lang="en"><h2 id="Testing_T238665" data-mw-section-id="1">Testing T238665</h2>
<p id="mwAw"><font id="mwBA" size="3" face="Gotic">foo</font></p></body>

From test.wikipedia.org:

<body id="mwAA" class="mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output" dir="ltr" lang="en"><h2 id="Testing_T238665" data-mw-section-id="1">Testing T238665</h2>
<p id="mwAw"><font id="mwBA" size="3" face="Gotic">foo</font></p></body>

The VE HTML is identical in both cases as well. BUT, note that VE has reversed the order of the <font> attributes when it loads the DOM. So, normally that would have explained this bug Except, with en.wikipedia.org that uses Parsoid/JS, there is no dirty diff. The dirty diff only manifests on Parsoid/PHP. So, why / how does Parsoid/JS not dirty this page?

This transcript below simulates the effect of VE's reordering. And, mysteriously enough, Parsoid/JS hides the reordering whereas Parsoid/PHP does not. Okay, now to figure out whether this is an intentional Parsoid/JS feature or an accidental Parsoid/JS bug.

[subbu@earth:~/work/wmf/parsoid] cat /tmp/wt
== Testing T238665 ==
<font face="Gotic" size="3">foo</font>

[subbu@earth:~/work/wmf/parsoid] cat /tmp/test.js.html
<h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2]}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0]}'><font face="Gotic" size="3" data-parsoid='{"stx":"html","dsr":[22,60,28,7]}'>foo</font></p>

[subbu@earth:~/work/wmf/parsoid] cat /tmp/test.edited.js.html
<h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2]}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0]}'><font size="3" face="Gotic" data-parsoid='{"stx":"html","dsr":[22,60,28,7]}'>foo</font></p>

[subbu@earth:~/work/wmf/parsoid] node bin/parse.js --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/test.js.html < /tmp/test.edited.js.html
== Testing T238665 ==
<font face="Gotic" size="3">foo</font>

[subbu@earth:~/work/wmf/parsoid] php bin/parse.php --selser --oldtextfile /tmp/wt --oldhtmlfile /tmp/test.js.html < /tmp/test.edited.js.html 
== Testing T238665 ==
<font size="3" face="Gotic">foo</font>
ssastry added subscribers: DLynch, Esanders, matmarex.

@Esanders @matmarex @DLynch why does VE reorder attributes on the <font> tag? Or is this a more general thing that is happening with VE when it loads the Parsoid DOM? Is this accidental and easy to fix? Should Parsoid/PHP try to suppress this?

@Esanders @matmarex @DLynch why does VE reorder attributes on the <font> tag? Or is this a more general thing that is happening with VE when it loads the Parsoid DOM? Is this accidental and easy to fix? Should Parsoid/PHP try to suppress this?

Never mind. Independent of why VE reordered the attributes, the ordering doesn't matter wrt HTML semantics. So, Parsoid should ignore ordering differences when computing DOM diffs and serializing. I bet this is a difference between the PHP libxml DOM and JS domino.

So, indeed, Parsoid/JS disregards the ordering differences between the two whereas Parsoid/PHP doesn't. See DOM diff output below. So, this does indeed turn out to be a dom-diffing issue as suspected in the description.

[subbu@earth:~/work/wmf/parsoid] bin/domdiff.test.js /tmp/test.js.html /tmp/test.edited.js.html 
<body><h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2]}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0]}'><font size="3" face="Gotic" data-parsoid='{"stx":"html","dsr":[22,60,28,7]}'>foo</font></p>
</body>

[subbu@earth:~/work/wmf/parsoid] php bin/domdiff.test.php /tmp/test.js.html /tmp/test.edited.js.html 
<body><h2 id="Testing_T238665" data-parsoid='{"dsr":[0,21,2,2],"tmp":{"isNew":false}}'>Testing T238665</h2>
<p data-parsoid='{"dsr":[22,60,0,0],"tmp":{"isNew":false}}' data-parsoid-diff='{"id":-1,"diff":["children-changed","subtree-changed"]}'><font size="3" face="Gotic" data-parsoid='{"stx":"html","dsr":[22,60,28,7],"tmp":{"isNew":false}}' data-parsoid-diff='{"id":-1,"diff":["modified-wrapper"]}'>foo</font></p>
</body>

So, to answer the questions in the description,

  • This is not a RESTBase issue
  • This is a difference in Parsoid/PHP and Parsoid/JS dom-diffing wrt reordered HTML attributes. VE reorders HTML attributes
  • html2wt (both Parsoid/JS & Parsoid/PHP) respects the string order of the attributes found in the HTML (as demonstrated below). The only reason Parsoid/JS doesn't dirty diff is because dom-diff seems to be normalizing attribute orders in Parsoid/JS but not Parsoid/PHP.
[subbu@earth:~/work/wmf/parsoid] echo "<font size="3" face="Gotic">x</font>" | php bin/parse.php --html2wt
<font size="3" face="Gotic">x</font>

[subbu@earth:~/work/wmf/parsoid] echo "<font size="3" face="Gotic">x</font>" | node bin/parse.js --html2wt
<font size="3" face="Gotic">x</font>

[subbu@earth:~/work/wmf/parsoid] echo "<font face="Gotic" size="3">x</font>" | php bin/parse.php --html2wt
<font face="Gotic" size="3">x</font>

[subbu@earth:~/work/wmf/parsoid] echo "<font face="Gotic" size="3">x</font>" | node bin/parse.js --html2wt
<font face="Gotic" size="3">x</font>

Looks like we have had code to ignore attribute order during DOM diffing since April 2013 ( first introduced in b5beb51e144e16b3cf1b89977e2bbd3a75b03bae ). So, this might just be a simple porting bug.

Change 551944 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Use sort, not asort to sort the attribute key array

https://gerrit.wikimedia.org/r/551944

Change 551944 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use sort, not asort to sort the attribute key array

https://gerrit.wikimedia.org/r/551944