Page MenuHomePhabricator

Investigate source of non-determinism in rt
Closed, ResolvedPublic

Description

Try node bin/roundtrip-test.js --prefix dewiki Pentaquark

@ssastry says,

i expect it is because some element/citation/ref id is different in the second round which throws off the diffing code.

Event Timeline

Arlolra triaged this task as Medium priority.Nov 23 2016, 6:51 PM
{{Literatur|arxiv=1507.03414|Titel=Observation of J/ψp resonances consistent with pentaquark states in <math>\Lambda_b^0 \to J/\psi K^- p</math> decays|Jahr=2015-07-13|Sprache=en|Sammelwerk=Phys. Rev. Lett.| Band= 115 |Seiten= 072001}}

The "Literatur" template invokes "Modul:Vorlage:Literatur" which requires "Modul:Zitation" which uses Titel there for string interpolation in Zitation.COinS. I'm not sure who's job it is to remove the strip markers but that leads to,

"&amp;rft.atitle=Observation+of+J%2F%CF%88p+resonances+consistent+with+pentaquark+states+in+%7F%27%22%60UNIQ--math-00000003-QINU%60%22%27%7F+decays"

and the "UNIQ--math-00000003-QINU" varies between parse requests.

I'm not sure who's job it is to remove the strip markers

T133477#2234972 says the module should be calling mw.text.killMarkers

Change 422345 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Eliminate a source of indeterminacy from leaked strip markers

https://gerrit.wikimedia.org/r/422345

Change 422350 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Normalize away unnecessary attributes in data-mw.html too

https://gerrit.wikimedia.org/r/422350

Change 422345 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Eliminate a source of indeterminacy from leaked strip markers

https://gerrit.wikimedia.org/r/422345

Change 422350 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Normalize away unnecessary attributes in data-mw.html too

https://gerrit.wikimedia.org/r/422350

For https://parsoid-rt-tests.wikimedia.org/resultFlagNew/16ced345bee068b73f8e129dd66a8dea3bee0cc3/ce4ffddfd42db404ee1f405a5abb583f5a732d61/dewiki/Robin%20Thicke

The template {{Zukunft|2018|3}} on dewiki probably uses the {{CURRENTTIMESTAMP}} variable to produce the sort key(?) for this category, [[Kategorie:Wikipedia:Veraltet nach März 2018| 20180328203041]], which Parsoid outputs as,

<link rel="mw:PageProp/Category" href="./Kategorie:Wikipedia:Veraltet_nach_März_2018#%2020180328203041" data-parsoid='{"stx":"piped","a":{"href":"./Kategorie:Wikipedia:Veraltet_nach_März_2018"},"sa":{"href":"Kategorie:Wikipedia:Veraltet nach März 2018"},"dsr":[0,63,null,null]}'/>

Whenever there's this variability in the content, roundtrip-test.js will automatically fail the "quick" semantic diff test,
https://github.com/wikimedia/parsoid/blob/master/bin/roundtrip-test.js#L407

after which, we need to rely on simplediff to give us comparable ranges to test. Unfortunately, it's not always great. For example, an extra newline is producing two diffs, which are trivially semantically different.

wt1 "|style=vertical-align:top|\n"
wt2 "|style=vertical-align:top|\n'''[[Goldene Schallplatte|Platin-Schallplatte]]'''\n"
@@ -1,0 +1,1 @@
+<p><b><a href="Goldene_Schallplatte" title="Goldene Schallplatte">Platin-Schallplatte</a></b></p>
wt1 "'''[[Goldene Schallplatte|Platin-Schallplatte]]'''\n"
wt2 ""
@@ -1,1 +1,0 @@
-<p><b><a href="Goldene_Schallplatte" title="Goldene Schallplatte">Platin-Schallplatte</a></b></p>

Maybe consider using a word diff?

http://localhost:8000/es.wikipedia.org/v3/page/html/Cleveland_Cavaliers/106401442 has duplicate template arguments, that get normalized in the first roundtrip,

-
|propietario     = 
|propietario     = [[Dan Gilbert]]
+
|propietario     =[[Dan Gilbert]]

which means that only the first time around will we output the category,

[ '-',
  [ '<span> </span><link rel="mw:PageProp/Category" href="./Categoría:Wikipedia:Páginas_con_plantillas_con_argumentos_duplicados"/>\n' ] ],

Change 422583 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Follow up to e034960 w/ the &apos; variant

https://gerrit.wikimedia.org/r/422583

https://en.wikipedia.org/wiki/Brazil_at_the_2016_Summer_Olympics uses a template that invokes Module:Sports_table, which has the following,

				-- Now define the identifier for this
				note_id = '"table_note_'..team_code_ii..rand_val..'"' -- Add random end for unique ID if more tables are present on article (which might otherwise share an ID)
				note_id_list[team_code_ii] = note_id

where rand_val is defined as,

	-- Random value used for uniqueness
	math.randomseed( os.clock() * 10^8 )
	local rand_val = math.random()

which results in cite ids like,

[ '-',
  [ '<td>8<sup class="mw-ref" id="cite_ref-table_hth_GER0.75888191385143_67-0" rel="dc:references" typeof="mw:Extension/ref" data-mw=\'{"name":"ref","body":{"id":"mw-reference-text-cite_note-table_hth_GER0.75888191385143-67"},"attrs":{"group":"lower-alpha","name":"table_hth_GER0.75888191385143"}}\'><a href="./Brazil_at_the_2016_Summer_Olympics#cite_note-table_hth_GER0.75888191385143-67" data-mw-group="lower-alpha"><span class="mw-reflink-text">[lower-alpha 1]</span></a></sup></td>\n' ] ],
[ '+',
  [ '<td>8<sup class="mw-ref" id="cite_ref-table_hth_GER0.36878909374065_67-0" rel="dc:references" typeof="mw:Extension/ref" data-mw=\'{"name":"ref","body":{"id":"mw-reference-text-cite_note-table_hth_GER0.36878909374065-67"},"attrs":{"group":"lower-alpha","name":"table_hth_GER0.36878909374065"}}\'><a href="./Brazil_at_the_2016_Summer_Olympics#cite_note-table_hth_GER0.36878909374065-67" data-mw-group="lower-alpha"><span class="mw-reflink-text">[lower-alpha 1]</span></a></sup></td>\n' ] ],

Change 422583 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Follow up to e034960 w/ the &apos; variant

https://gerrit.wikimedia.org/r/422583

The imagemap extension looks like it uses timestamps in various places,

<div class="noresize" typeof="mw:Extension/imagemap" data-mw=\'{"name":"imagemap","attrs":{},"body":{"extsrc":"\\nՊատկեր:Italy location map.svg|300px|Պեսկարա (Իտալիա)\\nrect 0 0 0 0 [[##]]\\ndesc none\\n"}}\'><map name="ImageMap_1_890608218" id="ImageMap_1_890608218"> <area href="##" shape="rect" coords="0,0,0,0" alt="##" title="##"/></map><img alt="Պեսկարա (Իտալիա)" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Italy_location_map.svg/300px-Italy_location_map.svg.png" width="300" height="377" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/be/Italy_location_map.svg/450px-Italy_location_map.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/be/Italy_location_map.svg/600px-Italy_location_map.svg.png 2x" data-file-width="1034" data-file-height="1299" usemap="#ImageMap_1_890608218"/></div>

Change 423038 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Eliminate variability from imagemap extension attributes

https://gerrit.wikimedia.org/r/423038

For posterity, these diffs were taken with,

diff --git a/bin/roundtrip-test.js b/bin/roundtrip-test.js
index f164566b..239d0706 100755
--- a/bin/roundtrip-test.js
+++ b/bin/roundtrip-test.js
@@ -416,6 +416,8 @@ var checkIfSignificant = function(offsets, data) {
                                });
                        }
                        return results;
+               } else {
+                       console.log(Diff.diffLines(normalizedOld, normalizedNew))
                }
        }

Change 423038 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Eliminate variability from imagemap extension attributes

https://gerrit.wikimedia.org/r/423038

Going to consider this round of investigation resolved. We can always reopen when this rears its ugly head again.