Page MenuHomePhabricator

id and fallback id differences
Closed, ResolvedPublic

Description

From zhwiki:驻中华人民共和国外交机构列表

----- JS:[328632, 328797] -----
<h3 id="中国大陆" data-parsoid='{"dsr":[15562,15577,3,3]}'>
<span id=".E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86" typeof="mw:FallbackId" data-parsoid='{"dsr":[15565,15565]}'>

+++++ PHP:[328632, 328799] +++++
<h3 id="_中国大陆" data-parsoid='{"dsr":[15562,15577,3,3]}'>
<span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86" typeof="mw:FallbackId" data-parsoid='{"dsr":[15565,15565]}'>

.....

----- JS:[828388, 828537] -----
<h3 id="香港_2" data-parsoid='{"dsr":[33732,33745,3,3]}'>
<span id=".E9.A6.99.E6.B8.AF_2" typeof="mw:FallbackId" data-parsoid='{"dsr":[33735,33735]}'>

+++++ PHP:[828404, 828555] +++++
<h3 id="_香港_2" data-parsoid='{"dsr":[33732,33745,3,3]}'>
<span id="_.E9.A6.99.E6.B8.AF_2" typeof="mw:FallbackId" data-parsoid='{"dsr":[33735,33735]}'>

... and on and on ....

Event Timeline

ssastry triaged this task as Medium priority.
ssastry updated the task description. (Show Details)

the regular id attribute seems to be different as well -- looks like we're not doing whitespace stripping on the left hand side appropriately.

cscott renamed this task from Fallback id differences to id and fallback id differences.Oct 16 2019, 5:54 PM

Minimum repro:

$ echo '==={{CHNML}}}===' | php bin/parse.php --domain zh.wikipedia.org --body_only
<h3 id="_中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3,null,null]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/44px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/33px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>

-vs-

$ echo '==={{CHNML}}}===' | bin/parse.js --domain zh.wikipedia.org --body_only
<h3 id="中国大陆}" data-parsoid='{"dsr":[0,16,3,3]}'><span id=".E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86.7D" typeof="mw:FallbackId" data-parsoid='{"dsr":[3,3]}'></span><span class="flagicon" about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"stx":"html","dsr":[3,12,null,null],"pi":[[]]}' data-mw='{"parts":[{"template":{"target":{"wt":"CHNML","href":"./Template:CHNML"},"params":{},"i":0}}]}'><figure-inline class="mw-image-border" typeof="mw:Image"><span><img alt="" resource="./File:Flag_of_the_People's_Republic_of_China.svg" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" data-file-width="900" data-file-height="600" data-file-type="drawing" height="15" width="22"/></span></figure-inline><span typeof="mw:Entity"> </span></span><a rel="mw:WikiLink" href="./中国大陆" title="中国大陆" about="#mwt1" data-parsoid='{"stx":"piped","a":{"href":"./中国大陆"},"sa":{"href":"中国大陆"}}'>中国大陆</a>}</h3>

The inline image in the heading seems to be part of the issue here.

Ok, tracked it down to:

Sanitizer.normalizeSectionIdWhiteSpace = function(id) {
	return id.replace(/[ _]+/g, ' ').trim();
};

vs

public static function normalizeSectionIdWhiteSpace( string $id ): string {
		return trim( preg_replace( '/[ _]+/', ' ', $id ) );
}

and JS and PHP having different definitions for trim. The legacy parser emits:

<h3><span id="_.E4.B8.AD.E5.9B.BD.E5.A4.A7.E9.99.86"></span><span class="mw-headline" id="_中国大陆"><span class="flagicon"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/22px-Flag_of_the_People%27s_Republic_of_China.svg.png" decoding="async" class="thumbborder" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/33px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/44px-Flag_of_the_People%27s_Republic_of_China.svg.png 2x" data-file-width="900" data-file-height="600" height="15" width="22">&nbsp;</span><a href="/wiki/%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86" title="中國大陸">中國大陸</a></span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8&amp;veaction=edit&amp;section=10" class="mw-editsection-visualeditor" title="Edit section: &nbsp;中國大陸">edit</a><span class="mw-editsection-divider"> | </span><a href="https://zh.wikipedia.org/w/index.php?title=%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8&amp;section=10&amp;veaction=editsource" title="Edit section: &nbsp;中國大陸">edit source</a><span class="mw-editsection-bracket">]</span></span></h3>

for https://zh.wikipedia.org/wiki/%E9%A9%BB%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%A4%96%E4%BA%A4%E6%9C%BA%E6%9E%84%E5%88%97%E8%A1%A8#_%E4%B8%AD%E5%9B%BD%E5%A4%A7%E9%99%86 so I suspect PHP is "right" here and JS should be fixed to eliminate the HTML diff.

Or else the legacy parser should be fixed to use a better trim... but that would break existing anchors.

Okay, this is yet one more instance where Parsoid/PHP matches core parser output. Not sure we should try to fix Parsoid/JS.

It's easy enough in this instance, and if it helps reduce noise in the HTML diffs it helps increase our confidence in deploying Parsoid/PHP so I think it's still worthwhile.

(I could have sworn I wrote a "trim like PHP" function already in Parsoid/JS somewhere....)

Change 543720 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Fix id attribute trim for section headings in Parsoid/JS

https://gerrit.wikimedia.org/r/543720

Change 543720 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Fix id attribute trim for section headings in Parsoid/JS

https://gerrit.wikimedia.org/r/543720