Page MenuHomePhabricator

Template data-parsoid "spc" property differences
Closed, DeclinedPublic

Description

On jawiki:ダーク・アイランド%2F堕ちた楽園

----- JS:[52, 1783] -----
<table class="infobox " style="width:22em; width:20em" about="#mwtX" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,368],"pi":[[{"k":"1"},{"k":"作品名","named":true,"spc":[" "," "," ","\n"]},{"k":"原題","named":true,"spc":[" "," "," ","\n"]},{"k":"画像","named":true,"spc":[" "," ",""," \n"]},{"k":"画像サイズ","named":true,"spc":[" "," ",""," \n"]},{"k":"画像解説","named":true,"spc":[" "," ",""," \n"]},{"k":"監督","named":true,"spc":[" "," "," ","\n"]},{"k":"製作","named":true,"spc":[" "," "," ","\n"]},{"k":"脚本","named":true,"spc":[" "," "," "," \n"]},{"k":"出演者","named":true,"spc":[" "," "," ","\n"]},{"k":"音楽","named":true,"spc":[" "," ",""," \n"]},{"k":"撮影","named":true,"spc":[" "," ",""," \n"]},{"k":"配給","named":true,"spc":[" "," ",""," \n"]},{"k":"公開","named":true,"spc":[" "," "," ","\n"]},{"k":"上映時間","named":true,"spc":[" "," "," ","\n"]},{"k":"製作国","named":true,"spc":[" "," "," ","\n"]},{"k":"言語","named":true,"spc":[" "," "," ","\n"]},{"k":"制作費","named":true,"spc":[" "," ",""," \n"]},{"k":"興行収入","named":true,"spc":[" "," ",""," \n"]},{"k":"前作","named":true,"spc":[" "," ",""," \n"]}]],"stx":"html"}' data-mw='{"parts":[{"template":{"target":{"wt":"Infobox Film","href":"./Template:Infobox_Film"},"params":{"1":{"wt":"\n"},"作品名":{"wt":"ダーク・アイランド/堕ちた楽園"},"原題":{"wt":"Dark Tide"},"画像":{"wt":""},"画像サイズ":{"wt":""},"画像解説":{"wt":""},"監督":{"wt":"[[ルカ・ベルコヴィッチ]]"},"製作":{"wt":"ロバート・L・レヴィ&lt;br />ピーター・J・エイブラハム"},"脚本":{"wt":"ピーター・J・エイブラハム&lt;br />サム・バーナード"},"出演者":{"wt":"[[ブリジット・バーコ]]&lt;br />[[クリス・サランドン]]&lt;br />[[リチャード・タイソン]]"},"音楽":{"wt":""},"撮影":{"wt":""},"配給":{"wt":""},"公開":{"wt":"{{flagicon|JPN}} 劇場未公開"},"上映時間":{"wt":"94分"},"製作国":{"wt":"{{USA}}"},"言語":{"wt":"[[英語]]"},"制作費":{"wt":""},"興行収入":{"wt":""},"前作":{"wt":""}},"i":0}}]}'>

+++++ PHP:[52, 1783] +++++
<table class="infobox " style="width:22em; width:20em" about="#mwtX" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,368],"pi":[[{"k":"1"},{"k":"作品名","named":true,"spc":[" "," "," ","\n"]},{"k":"原題","named":true,"spc":[" "," "," ","\n"]},{"k":"画像","named":true,"spc":[" "," ",""," \n"]},{"k":"画像サイズ","named":true,"spc":[" "," ",""," \n"]},{"k":"画像解説","named":true,"spc":[" "," ",""," \n"]},{"k":"監督","named":true,"spc":[" "," "," ","\n"]},{"k":"製作","named":true,"spc":[" "," "," ","\n"]},{"k":"脚本","named":true,"spc":[" "," "," ","\n"]},{"k":"出演者","named":true,"spc":[" "," "," ","\n"]},{"k":"音楽","named":true,"spc":[" "," ",""," \n"]},{"k":"撮影","named":true,"spc":[" "," ",""," \n"]},{"k":"配給","named":true,"spc":[" "," ",""," \n"]},{"k":"公開","named":true,"spc":[" "," "," ","\n"]},{"k":"上映時間","named":true,"spc":[" "," "," ","\n"]},{"k":"製作国","named":true,"spc":[" "," "," ","\n"]},{"k":"言語","named":true,"spc":[" "," "," ","\n"]},{"k":"制作費","named":true,"spc":[" "," ",""," \n"]},{"k":"興行収入","named":true,"spc":[" "," ",""," \n"]},{"k":"前作","named":true,"spc":[" "," ",""," \n"]}]],"stx":"html"}' data-mw='{"parts":[{"template":{"target":{"wt":"Infobox Film","href":"./Template:Infobox_Film"},"params":{"1":{"wt":"\n"},"作品名":{"wt":"ダーク・アイランド/堕ちた楽園"},"原題":{"wt":"Dark Tide"},"画像":{"wt":""},"画像サイズ":{"wt":""},"画像解説":{"wt":""},"監督":{"wt":"[[ルカ・ベルコヴィッチ]]"},"製作":{"wt":"ロバート・L・レヴィ&lt;br />ピーター・J・エイブラハム"},"脚本":{"wt":"ピーター・J・エイブラハム&lt;br />サム・バーナード "},"出演者":{"wt":"[[ブリジット・バーコ]]&lt;br />[[クリス・サランドン]]&lt;br />[[リチャード・タイソン]]"},"音楽":{"wt":""},"撮影":{"wt":""},"配給":{"wt":""},"公開":{"wt":"{{flagicon|JPN}} 劇場未公開"},"上映時間":{"wt":"94分"},"製作国":{"wt":"{{USA}}"},"言語":{"wt":"[[英語]]"},"制作費":{"wt":""},"興行収入":{"wt":""},"前作":{"wt":""}},"i":0}}]}'>

On frwiki:Last_Week_Tonight_with_John_Oliver:

----- JS:[319924, 320868] -----
<tr style="text-align: center; background:#F2F2F2" about="#mwtX" typeof="mw:Transclusion" data-parsoid='{"dsr":[53304,53610],"pi":[[{"k":"titre","named":true,"spc":["","","","\n"]},{"k":"div 1","named":true,"spc":["","","","   \n"]},{"k":"div 4","named":true,"spc":["","",""," \n"]},{"k":"diffusion originale","named":true,"spc":[""," "," ","\n"]},{"k":"résumé","named":true,"spc":["","","","\n"]},{"k":"ligne séparatrice","named":true,"spc":["","","","\n"]}]],"stx":"html"}' data-mw='{"parts":[{"template":{"target":{"wt":"Liste des épisodes\n","href":"./Modèle:Liste_des_épisodes"},"params":{"titre":{"wt":"Épisode 98"},"div 1":{"wt":""},"div 4":{"wt":""},"diffusion originale":{"wt":"{{Date début|2017|04|16}}"},"résumé":{"wt":"Segment: [[Élection présidentielle française de 2017]], [[Sean Spicer]], [[Positions politiques de Donald Trump]], {{Lien|langue=en|fr=2017 Nangarhar airstrike}}"},"ligne séparatrice":{"wt":"333333"}},"i":0}}]}'>

+++++ PHP:[319924, 320868] +++++
<tr style="text-align: center; background:#F2F2F2" about="#mwtX" typeof="mw:Transclusion" data-parsoid='{"dsr":[53304,53610],"pi":[[{"k":"titre","named":true,"spc":["","","","\n"]},{"k":"div 1","named":true,"spc":["","","  ","\n"]},{"k":"div 4","named":true,"spc":["","",""," \n"]},{"k":"diffusion originale","named":true,"spc":[""," "," ","\n"]},{"k":"résumé","named":true,"spc":["","","","\n"]},{"k":"ligne séparatrice","named":true,"spc":["","","","\n"]}]],"stx":"html"}' data-mw='{"parts":[{"template":{"target":{"wt":"Liste des épisodes\n","href":"./Modèle:Liste_des_épisodes"},"params":{"titre":{"wt":"Épisode 98"},"div 1":{"wt":" "},"div 4":{"wt":""},"diffusion originale":{"wt":"{{Date début|2017|04|16}}"},"résumé":{"wt":"Segment: [[Élection présidentielle française de 2017]], [[Sean Spicer]], [[Positions politiques de Donald Trump]], {{Lien|langue=en|fr=2017 Nangarhar airstrike}}"},"ligne séparatrice":{"wt":"333333"}},"i":0}}]}'>

Event Timeline

ssastry triaged this task as Medium priority.Oct 16 2019, 1:39 PM
ssastry created this task.

The jawiki infobox example might be simpler to debug since it is at the top of the page and I ran it through parse.js and parse.php and I can reproduce it locally. Likely a difference in trimming.

Okay .. so, this is a case of unicode spaces and regexp differences in JS & PHP. We have a number of PORT-FIXMEs in the codebase about this exact scenario, but not in the TemplateHandler where this is manifesting.

I am spacing out (hah!) about what the right behavior ought to be .. i.e. whether the unicode spcace could be counted (as in JS) or not (as in PHP).

The diff isn't a big deal in that this property ('spc') exists to reproduce spaces when going from HTML -> WT in cases where selective serialization doesn't apply.

As it turns out in the JS case, the unicode space is treated as a space and is present in the 'spc' attribute. But, in PHP case, the unicode space isn't treated as a space and becomes a part of the wikitext arg.

So, this won't really matter in terms of html2wt diffs except when this param is edited.

But, I suppose the simplest strategy would be to mimic JS output here, since it seems reasonable to treat the unicode trailing space not as part of the arg, but as readability whitespace between lines.

That said trim won't trim unicode whitespace ( https://www.php.net/manual/en/function.trim.php ) ... so, matching unicode space in the "spc" capturing regexp isn't good if trim won't trim it from the wikitext arg. Hmm ...

Which of these (from your second example) is actually correct?

{"k":"div 1","named":true,"spc":["","","","   \n"]},
{"k":"div 1","named":true,"spc":["","","  ","\n"]},

Either the whitespace appears both before *and* after div 1 in the wikitext (PHP) or it only appears after div 1 (JS). One of those has to be wrong, and we should fix it.

Look at https://fr.wikipedia.org/w/index.php?title=Last_Week_Tonight_with_John_Oliver&action=edit&section=3 and Episode 98. It is something like |div1= \n. Both interpretations are correct. Offsets 2 and 3 are for whitespace around the arg-value. In this case, if you treat the arg as '', then you can treat it either how JS treats it or how PHP treats it. Makes no material difference to the round tripping for which this exists.

Ah. Is the value always '' in cases where the diff appears? It seemed like you were saying that in some cases the unicode space wasn't being properly (?) stripped from the value. (That seems to be the case in T235684, for example.)

Ah. Is the value always '' in cases where the diff appears? It seemed like you were saying that in some cases the unicode space wasn't being properly (?) stripped from the value. (That seems to be the case in T235684, for example.)

The unicode space issue is the from the japenese infobox example. Look at the infobox wikitext from that wiki page and inspect the trailing space for that parameter in question.

Look at https://fr.wikipedia.org/w/index.php?title=Last_Week_Tonight_with_John_Oliver&action=edit&section=3 and Episode 98. It is something like |div1= \n. Both interpretations are correct. Offsets 2 and 3 are for whitespace around the arg-value. In this case, if you treat the arg as '', then you can treat it either how JS treats it or how PHP treats it. Makes no material difference to the round tripping for which this exists.

Although not sure why the diff did show up in this case. Maybe there is some unicode char lurking in there and I didn't look closely.

[subbu@earth:~] node
> s = "d=   \n";
> s.match(/^(\s*)[^]*?(\s*)$/)
[ 'd=   \n', '', '   \n', index: 0, input: 'd=   \n' ]

> [subbu@earth:~] psysh
Psy Shell v0.9.9 (PHP 7.2.19-0ubuntu0.18.04.2 — cli) by Justin Hileman
>>> $s = "d=   \n";
>>> preg_match( '/^(\s*)[\s\S]*?(\s*)$/D', $s, $valueSpaceMatch )
>>> $valueSpaceMatch
=> [
     "d=   \n",
     "",
     "   \n",
   ]
>>>
ssastry lowered the priority of this task from Medium to Low.Oct 25 2019, 2:12 PM

My hunch here is that the Parsoid/PHP behavior is likely to mimic core parser's whitespace handling for template args.

Parsoid/JS is long gone and we aren't going to worry about this difference anymore.