Page MenuHomePhabricator

Link prefix differences between Parsoid/JS & Parsoid/PHP
Closed, ResolvedPublic

Description

For kawiki:იოსებ_სტალინი

----- JS:[50736, 50894] -----
</a>–
<a rel="mw:WikiLink" href="./1923" title="1923" data-parsoid='{"a":{"href":"./1923"},"dsr":[11350,11358,2,2],"sa":{"href":"1923"},"stx":"simple"}'>1923

+++++ PHP:[50789, 50960] +++++
</a>
<a rel="mw:WikiLink" href="./1923" title="1923" data-parsoid='{"a":{"href":"./1923"},"dsr":[11349,11358,3,2],"prefix":"–","sa":{"href":"1923"},"stx":"simple"}'>–1923

Details

Related Gerrit Patches:

Event Timeline

ssastry triaged this task as Normal priority.Oct 15 2019, 9:09 PM
ssastry created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 15 2019, 9:09 PM
ssastry closed this task as Invalid.Oct 15 2019, 9:10 PM

Turns out Parsoid/JS is buggy ... https://ka.wikipedia.org/wiki/იოსებ_სტალინი shows that the "-" is indeed a link prefix on that page.

https://ar.wikipedia.org/wiki/%D8%B1%D8%A7%D9%86%D8%A8%D9%8A%D8%B1_%D9%83%D8%A7%D8%A8%D9%88%D8%B1 is another instance where "Nominated-" is a link prefix in Parsoid/PHP but not in Parsoid/JS. But, Parsoid/JS is the buggy one here.

ssastry reopened this task as Open.Oct 17 2019, 3:21 PM
ssastry lowered the priority of this task from Normal to Low.
ssastry moved this task from Backlog to Bugs on the Parsoid-PHP board.

This seems to be the next biggest source of diffs. It looks like Parsoid/JS is at fault on all these various wikis -- it is possible it is a simple siteconfig issue that is causing this.

ssastry moved this task from Bugs to Parsoid/JS bugs on the Parsoid-PHP board.Oct 17 2019, 3:27 PM
cscott added a subscriber: cscott.Oct 19 2019, 2:59 AM

Link prefix/trail on arwiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u0621-\u064a]+)(.*)$/sDu",

Link prefix/trail on kawiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u10d0\u10d1\u10d2\u10d3\u10d4\u10d5\u10d6\u10d7\u10d8\u10d9\u10da\u10db\u10dc\u10dd\u10de\u10df\u10e0\u10e1\u10e2\u10e3\u10e4\u10e5\u10e6\u10e7\u10e8\u10e9\u10ea\u10eb\u10ec\u10ed\u10ee\u10ef\u10f0\u201c\u00bb]+)(.*)$/sDu"

My suspicion would be that we're not properly capturing the regexp modifiers; 'u' (PCRE_UNICODE) change the behavior of .; s (PCRE_DOTALL) modifier exists in JS but maybe we're not propagating it; the D (PCRE_DOLLAR_ENDONLY) is JavaScript's default behavior anyway.

from kawiki is \u2013; from arwiki is \u2014. Both of these are in the link prefix range of \x80-\x10fffff, but even if we're not propagating the u modifier across, it would be parsed as [\x80-\xDBFF\xDFFF] and \u2013/\u2014 ought to be in that range.

*BUT* JavaScript doesn't actually parse the \x{....} syntax:

> r=/[\u0061\x62\x{63}]/;
/[\u0061\x62\x{63}]/
> r.test('a')
true
> r.test('b')
true
> r.test('c')
false
> r.test('x')
true

So that's the problem.

Change 544967 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Handle unicode escapes higher than \uFFFF properly in link prefix/trail

https://gerrit.wikimedia.org/r/544967

Change 544967 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Handle unicode escapes higher than \uFFFF properly in link prefix/trail

https://gerrit.wikimedia.org/r/544967

ssastry closed this task as Resolved.Oct 22 2019, 4:28 AM
ssastry assigned this task to cscott.