Page MenuHomePhabricator

Link prefix differences between Parsoid/JS & Parsoid/PHP
Closed, ResolvedPublic

Description

For kawiki:იოსებ_სტალინი

----- JS:[50736, 50894] -----
</a>–
<a rel="mw:WikiLink" href="./1923" title="1923" data-parsoid='{"a":{"href":"./1923"},"dsr":[11350,11358,2,2],"sa":{"href":"1923"},"stx":"simple"}'>1923

+++++ PHP:[50789, 50960] +++++
</a>
<a rel="mw:WikiLink" href="./1923" title="1923" data-parsoid='{"a":{"href":"./1923"},"dsr":[11349,11358,3,2],"prefix":"–","sa":{"href":"1923"},"stx":"simple"}'>–1923

Event Timeline

ssastry created this task.

Turns out Parsoid/JS is buggy ... https://ka.wikipedia.org/wiki/იოსებ_სტალინი shows that the "-" is indeed a link prefix on that page.

https://ar.wikipedia.org/wiki/%D8%B1%D8%A7%D9%86%D8%A8%D9%8A%D8%B1_%D9%83%D8%A7%D8%A8%D9%88%D8%B1 is another instance where "Nominated-" is a link prefix in Parsoid/PHP but not in Parsoid/JS. But, Parsoid/JS is the buggy one here.

ssastry lowered the priority of this task from Medium to Low.
ssastry moved this task from Backlog to Bugs, Notices, Crashers on the Parsoid-PHP board.

This seems to be the next biggest source of diffs. It looks like Parsoid/JS is at fault on all these various wikis -- it is possible it is a simple siteconfig issue that is causing this.

Link prefix/trail on arwiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u0621-\u064a]+)(.*)$/sDu",

Link prefix/trail on kawiki is:

"linkprefixcharset": "a-zA-Z\\x{80}-\\x{10ffff}",
"linkprefix": "/^((?>.*[^a-zA-Z\\x{80}-\\x{10ffff}]|))(.+)$/sDu",
"linktrail": "/^([a-z\u10d0\u10d1\u10d2\u10d3\u10d4\u10d5\u10d6\u10d7\u10d8\u10d9\u10da\u10db\u10dc\u10dd\u10de\u10df\u10e0\u10e1\u10e2\u10e3\u10e4\u10e5\u10e6\u10e7\u10e8\u10e9\u10ea\u10eb\u10ec\u10ed\u10ee\u10ef\u10f0\u201c\u00bb]+)(.*)$/sDu"

My suspicion would be that we're not properly capturing the regexp modifiers; 'u' (PCRE_UNICODE) change the behavior of .; s (PCRE_DOTALL) modifier exists in JS but maybe we're not propagating it; the D (PCRE_DOLLAR_ENDONLY) is JavaScript's default behavior anyway.

from kawiki is \u2013; from arwiki is \u2014. Both of these are in the link prefix range of \x80-\x10fffff, but even if we're not propagating the u modifier across, it would be parsed as [\x80-\xDBFF\xDFFF] and \u2013/\u2014 ought to be in that range.

*BUT* JavaScript doesn't actually parse the \x{....} syntax:

> r=/[\u0061\x62\x{63}]/;
/[\u0061\x62\x{63}]/
> r.test('a')
true
> r.test('b')
true
> r.test('c')
false
> r.test('x')
true

So that's the problem.

Change 544967 had a related patch set uploaded (by C. Scott Ananian; owner: C. Scott Ananian):
[mediawiki/services/parsoid@master] Handle unicode escapes higher than \uFFFF properly in link prefix/trail

https://gerrit.wikimedia.org/r/544967

Change 544967 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Handle unicode escapes higher than \uFFFF properly in link prefix/trail

https://gerrit.wikimedia.org/r/544967

ssastry assigned this task to cscott.