Page MenuHomePhabricator

id and fallback id differences
Closed, ResolvedPublic

Description

From zhwiki:驻中华人民共和国外交机构列表

----- JS:[328632, 328797] -----
<h3 id="中国大陆" data-parsoid='{"dsr":[15562,15577,3,3]}'>
<span id=".E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86" typeof="mw:FallbackId" data-parsoid='{"dsr":[15565,15565]}'>

+++++ PHP:[328632, 328799] +++++
<h3 id="_中国大陆" data-parsoid='{"dsr":[15562,15577,3,3]}'>
<span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86" typeof="mw:FallbackId" data-parsoid='{"dsr":[15565,15565]}'>

.....

----- JS:[828388, 828537] -----
<h3 id="香港_2" data-parsoid='{"dsr":[33732,33745,3,3]}'>
<span id=".E9.A6.99.E6.B8.AF_2" typeof="mw:FallbackId" data-parsoid='{"dsr":[33735,33735]}'>

+++++ PHP:[828404, 828555] +++++
<h3 id="_香港_2" data-parsoid='{"dsr":[33732,33745,3,3]}'>
<span id="_.E9.A6.99.E6.B8.AF_2" typeof="mw:FallbackId" data-parsoid='{"dsr":[33735,33735]}'>

... and on and on ....

Event Timeline

ssastry triaged this task as Medium priority.Oct 16 2019, 5:36 PM
ssastry created this task.
ssastry updated the task description. (Show Details)

the regular id attribute seems to be different as well -- looks like we're not doing whitespace stripping on the left hand side appropriately.

cscott renamed this task from Fallback id differences to id and fallback id differences.Oct 16 2019, 5:54 PM

Minimum repro:

$ echo '==={{CHNML}}}===' | php bin/parse.php --domain zh.wikipedia.org --body_only
<h3 id="_中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3,null,null]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/44px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/33px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>

-vs-

$ echo '==={{CHNML}}}===' | bin/parse.js --domain zh.wikipedia.org --body_only
<h3 id="中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id=".E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>

The inline image in the heading seems to be part of the issue here.

Ok, tracked it down to:

Sanitizer.normalizeSectionIdWhiteSpace = function(id) {
	return id.replace(/[ _]+/g, ' ').trim();
};

vs

public static function normalizeSectionIdWhiteSpace( string $id ): string {
		return trim( preg_replace( '/[ _]+/', ' ', $id ) );
}

and JS and PHP having different definitions for trim. The legacy parser emits:

<h3><span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86"></span><span class="mw-headline" id="_中国大陆"><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" decoding="async" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/33px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/44px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" data-file-width="900" data-file-height="600" height="15" width="22">&nbsp;</span><a href="/wiki/%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86" title="中國大陸">中國大陸</a></span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8&amp;veaction=edit&amp;section=10" class="mw-editsection-visualeditor" title="Edit section: &nbsp;中國大陸">edit</a><span class="mw-editsection-divider"> | </span><a href="https://zh.wikipedia.org/w/index.php?title=%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8&amp;section=10&amp;veaction=editsource" title="Edit section: &nbsp;中國大陸">edit source</a><span class="mw-editsection-bracket">]</span></span></h3>

for https://zh.wikipedia.org/wiki/%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8#_%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86 so I suspect PHP is "right" here and JS should be fixed to eliminate the HTML diff.

Or else the legacy parser should be fixed to use a better trim... but that would break existing anchors.

Okay, this is yet one more instance where Parsoid/PHP matches core parser output. Not sure we should try to fix Parsoid/JS.

It's easy enough in this instance, and if it helps reduce noise in the HTML diffs it helps increase our confidence in deploying Parsoid/PHP so I think it's still worthwhile.

(I could have sworn I wrote a "trim like PHP" function already in Parsoid/JS somewhere....)

Change 543720 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Fix id attribute trim for section headings in Parsoid/JS

https://gerrit.wikimedia.org/r/543720

Change 543720 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix id attribute trim for section headings in Parsoid/JS

https://gerrit.wikimedia.org/r/543720