Page MenuHomePhabricator

Tracking task for addressing HTML string diffs between Parsoid/JS & Parsoid/PHP
Closed, ResolvedPublic

Description

Based on initial testing ( See T234697#5556685 ), I found a bunch of diffs.

Of these, the following just need to be normalized away

  • JS code emitting, null,null in DSR in some cases and PHP code sometimes having ,0,0 in some cases and related diffs -- currently normalized by the test script to hide these diffs (See T231570)
  • data-mw template params sorted differently in JS & PHP
  • Any attributes that are generated and aren't deterministic (certain title tags generated by extensions / templates, about ids in Parsoid which are currently normalized away)
  • On a wikivoyage page test, PHP output fixes a problem compared to JS for <maplink>. Parsoid/JS emitted ?'"UNIQ--maplink-00000000-QINU"'? whereas Parsoid/PHP emits a <maplink> tag. So, nothing to fix here, but we may want to normalize this away to treat this as an acceptable diff ( <div class="magnify" title="Enlarge map"> is the wrapper that needs to be stripped out.)

The following need to be addressed in some form.

  • <head> diffs - addressed by https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/521320 -- blocked on review
  • <body> missing lang=".." attribute. Blocked on language variant code being complete -- blocked on @cscott finishing up language variant code
  • <video> source tags use https in Parsoid/JS and http in Parsoid/PHP. @cscott is handling this
  • {{DEFAULTSORT:....}} renders as a <meta> tag in Parsoid/JS and <span typeof="mw:Transclusion ..> in Parsoid/PHP (See T235004)

None of these seem big, but need to be addressed. Will file individual tasks for each instance.

Event Timeline

ssastry triaged this task as Medium priority.Oct 8 2019, 5:50 PM
  • <video> source tags use https in Parsoid/JS and http in Parsoid/PHP

This is an artefact of the testing process. The test requested http urls and so the Parsoid/PHP output had http media urls. But, Parsoid/JS is configured to issue https requests to the API, so we get back https media urls. So, not a real diff. We should issue https requests. But, if we cannot issue https requests internally (from RESTBase to Parsoid), we will need a new config setting that is set to the mediawiki protocol setting which is used in DataAccess independent of what PROTO_CURRENT is.

data-mw template params sorted differently in JS & PHP

See output below. So, Parsoid/PHP uses the parameter order found in the source transclusion, but Parsoid/JS doesn't preserve the ordering of keys in the object. Not sure if there is anything we should do OR just normalize this during testing.

[subbu@earth:~/work/wmf/parsoid] echo "{{foo|x|y|z|abcd=pqrs|xyz=ier|p}}" | php bin/parse.php --body_only
<p about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,33,0,0],"pi":[[{"k":"1"},{"k":"2"},{"k":"3"},{"k":"abcd","named":true},{"k":"xyz","named":true},{"k":"4"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"foo","href":"./Template:Foo"},"params":{"1":{"wt":"x"},"2":{"wt":"y"},"3":{"wt":"z"},"abcd":{"wt":"pqrs"},"xyz":{"wt":"ier"},"4":{"wt":"p"}},"i":0}}]}'>Sample template content.
</p>

[subbu@earth:~/work/wmf/parsoid] echo "{{foo|x|y|z|abcd=pqrs|xyz=ier|p}}" | node bin/parse.js --body_only
<p about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,33,0,0],"pi":[[{"k":"1"},{"k":"2"},{"k":"3"},{"k":"abcd","named":true},{"k":"xyz","named":true},{"k":"4"}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"foo","href":"./Template:Foo"},"params":{"1":{"wt":"x"},"2":{"wt":"y"},"3":{"wt":"z"},"4":{"wt":"p"},"abcd":{"wt":"pqrs"},"xyz":{"wt":"ier"}},"i":0}}]}'>Sample template content.
</p>

{{DEFAULTSORT:....}} renders as a <meta> tag in Parsoid/JS and <span typeof="mw:Transclusion ..> in Parsoid/PHP

Filed T235004: {{DEFAULTSORT: ... }} renders differently in Parsoid/PHP compared to Parsoid/JS for this.

ssastry renamed this task from Placeholder task for HTML string diffs between Parsoid/JS & Parsoid/PHP to Tracking task for addressing HTML string diffs between Parsoid/JS & Parsoid/PHP.Oct 8 2019, 9:39 PM
ssastry updated the task description. (Show Details)
ssastry added a subscriber: cscott.

More results. So, we will need a few more normalizations.

ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js it.wikipedia.org Luna
it.wikipedia.org:Luna: NO HTML DIFFS FOUND!
ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js de.wikipedia.org Bretagne
DIFFS FOR de.wikipedia.org:Bretagne
----- JS:[261374, 261544] -----
<ol class="mw-references references" typeof="mw:Extension/references" about="#mwtX" data-parsoid='{"dsr":[58908,58922,14,0]}' data-mw='{"name":"references","attrs":{}}'>
 
+++++ PHP:[261374, 261617] +++++
<div class="mw-references-wrap mw-references-columns" typeof="mw:Extension/references" about="#mwtX" data-parsoid='{"dsr":[58908,58922,14,0]}' data-mw='{"name":"references","attrs":{}}'>
<ol class="mw-references references" data-parsoid="{}">
 
----- JS:[276504, 276504] -----
 
+++++ PHP:[276577, 276584] +++++
</div>
 
ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js en.wikipedia.org Berlin
DIFFS FOR en.wikipedia.org:Berlin
----- JS:[39420, 39556] -----
<span about="#mwtX">
 
</span>
<link rel="mw:PageProp/Category" href="./Category:Pages_using_the_Kartographer_extension" about="#mwtX"/>
 
+++++ PHP:[39420, 39420] +++++
 
ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js en.wikipedia.org Barack_Obama
DIFFS FOR en.wikipedia.org:Barack_Obama
----- JS:[328930, 329321] -----
<video poster="//upload.wikimedia.org/wikipedia/commons/thumb/a/a5/20090124_WeeklyAddress.ogv/220px-seek%3D63-20090124_WeeklyAddress.ogv.jpg" controls="" preload="none" height="124" width="220" resource="./File:20090124_WeeklyAddress.ogv" data-parsoid='{"a":{"height":"124","width":"220","resource":"./File:20090124_WeeklyAddress.ogv"},"sa":{"resource":"File:20090124 WeeklyAddress.ogv"}}'>
 
+++++ PHP:[328930, 329312] +++++
<video poster="//upload.wikimedia.org/wikipedia/commons/thumb/a/a5/20090124_WeeklyAddress.ogv/220px--20090124_WeeklyAddress.ogv.jpg" controls="" preload="none" height="124" width="220" resource="./File:20090124_WeeklyAddress.ogv" data-parsoid='{"a":{"height":"124","width":"220","resource":"./File:20090124_WeeklyAddress.ogv"},"sa":{"resource":"File:20090124 WeeklyAddress.ogv"}}'>

Change 542227 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] First pass creating a Parsoid/JS & Parsoid/PHP HTML diffing script

https://gerrit.wikimedia.org/r/542227

Tracking categories emitted by templates are handled different in Parsoid/PHP and Parsoid/JS. A number of tracking categories found in Parsoid/JS output aren't found in Parsoid/PHP output (ex: Pages using Kartographer extension, Pages using Maps, Pages with script errors, etc),. Parsoid/PHP output seems to match core parser output here. So, maybe JS parse is missing some config setting.

And in some cases, it is emitted as a hatnote (see below) since Parsoid/PHP thinks we are in preview mode. So, maybe Parsoid/PHP is missing some config setting as well.

ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js diff.yaml en.wikipedia.org Aam_Aadmi_Party
DIFFS FOR en.wikipedia.org:Aam_Aadmi_Party
----- JS:[30194, 30573] -----
<link rel="mw:PageProp/Category" href="./Category:Pages_using_infobox_Indian_political_party_with_unknown_parameters#no_statesAam%20Aadmi%20Party" about="#mwtX" data-parsoid='{"a":{"href":"./Category:Pages_using_infobox_Indian_political_party_with_unknown_parameters"},"sa":{"href":"Category:Pages using infobox Indian political party with unknown parameters"},"stx":"piped"}'/>

+++++ PHP:[30194, 30563] +++++
<div class="hatnote" style="color:red" about="#mwtX" data-parsoid='{"stx":"html"}'>
<strong>Warning:
</strong> Page using 
<a rel="mw:WikiLink" href="./Template:Infobox_Indian_political_party" title="Template:Infobox Indian political party">Template:Infobox Indian political party
</a> with unknown parameter "no_states" (this message is shown only in preview).
</div>

These diffs should be normalized away

ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js diff.yaml fr.wikipedia.org Paris
DIFFS FOR fr.wikipedia.org:Paris
...
----- JS:[94991, 95143] -----
<li class="gallerybox" style="width: 477.3333333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.3333333333333px;" data-parsoid="{}">

+++++ PHP:[94895, 95043] +++++
<li class="gallerybox" style="width: 477.33333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.33333333333px;" data-parsoid="{}">

These diffs should be normalized away

ssastry@scandium:/srv/deployment/parsoid/deploy/src/bin$ node diff.html.js diff.yaml fr.wikipedia.org Paris
DIFFS FOR fr.wikipedia.org:Paris
...
----- JS:[94991, 95143] -----
<li class="gallerybox" style="width: 477.3333333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.3333333333333px;" data-parsoid="{}">

+++++ PHP:[94895, 95043] +++++
<li class="gallerybox" style="width: 477.33333333333px;" data-parsoid="{}">
<div class="thumb" style="width: 475.33333333333px;" data-parsoid="{}">

There's a phab task to round these, in both core/legacy and Parsoid: T229594: Eliminate fractional pixel widths in Gallery packed-* modes.

Change 542227 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Create a Parsoid/JS & Parsoid/PHP HTML diffing script

https://gerrit.wikimedia.org/r/542227