Page MenuHomePhabricator

Linktrail&prefix is wrongly applied to CJK characters
Closed, ResolvedPublic

Description

Wikitext:

[[汉]]字

Parsoid:

<p data-parsoid="{&quot;dsr&quot;:[0,6]}"><a rel="mw:WikiLink" href="./汉" data-parsoid="{&quot;tsr&quot;:[0,6],&quot;bsp&quot;:[0,6],&quot;a&quot;:{&quot;href&quot;:&quot;./汉&quot;},&quot;sa&quot;:{&quot;href&quot;:&quot;汉&quot;},&quot;stx&quot;:&quot;simple&quot;,&quot;tail&quot;:&quot;字&quot;,&quot;dsr&quot;:[0,6]}">汉字</a></p>

PHP Parser https://www.mediawiki.org/w/index.php?title=Project:Sandbox&oldid=594926:

<p><a href="/w/index.php?title=%E6%B1%89&amp;action=edit&amp;redlink=1" class="new" title="汉 (page does not exist)">汉</a>字</p>


Version: unspecified
Severity: normal

Details

Reference
bz41151

Event Timeline

bzimport raised the priority of this task from to Low.
bzimport set Reference to bz41151.
liangent created this task.Oct 18 2012, 7:18 AM

We don't support the localized link trail regexp yet and default to the English one, so in this case the trail is not matched as such.

Currently our focus is to make Parsoid safe for the English Wikipedia first, so that we can release it as a demo in December. After that release the plan is to shift the focus to C++, which will also enable us to call back into PHP. That should allow us to reuse the existing localizations and message systems, so we don't spend too much time reinventing the wheel in a JavaScript prototype.

[09:19] <liangent> $linkTrail = '/^([a-z]+)(.*)$/sD'; is in MessagesEn.php
[09:20] <liangent> this shouldn't include CJK characters, right?
[09:20] <gwicke> I would think so
[09:20] <liangent> but parsoid includes CJK chars in linktrail..
[09:21] <gwicke> interesting- I guess we approximate the regexp to something more liberal right now
[09:22] <gwicke> there is no i18n support yet, so we don't use the localized regexps
[09:23] <gwicke> we currently have tail:( ![A-Z \t(),.:\n\r-] tc:text_char { return tc } )*
[09:24] <gwicke> I think the idea was to be very liberal about tails in the tokenizer, and to convert/validate based on language in token stream transforms
[09:24] <gwicke> invalid tails can then be converted back to a text token
[09:25] <gwicke> the A-Z might be a bit fishy in that context though..

This specific example is working now as we have extended our negative char class to include something very close to the union of the complements of per-language character classes. This is a bit of a departure from default MediaWiki behavior, but might not be noticeable in practice.

We'd get consistent link trail behavior across languages if this works ok. If it does not, we'd have to revert to traditional per-language regexps.

Liangent, how well does the new regexp work for Chinese?

(In reply to comment #3)

Liangent, how well does the new regexp work for Chinese?

[[中]]cjk[[文]]

This gets linktrailed which are usually not wanted. I believe there're some real world use cases on zhwiki as we usually don't put a space between Chinese and embedded English words.

We have switched to use per-language link trail (and prefix) regexps now with https://gerrit.wikimedia.org/r/#/c/48589/. Our HTML form defaults to English settings, but testing on a test page from the Chinese Wikipedia has a good chance of working. Can you verify that the test cases above are now fixed?

(In reply to comment #5)

We have switched to use per-language link trail (and prefix) regexps now with
https://gerrit.wikimedia.org/r/#/c/48589/. Our HTML form defaults to English
settings, but testing on a test page from the Chinese Wikipedia has a good
chance of working. Can you verify that the test cases above are now fixed?

Can you make that HTML form accept an extra "language" option?

We could do so, but it is pretty low priority for us. Note that you can also point the web service to a page in the user namespace like this:

http://parsoid.wmflabs.org/zh/User:Liangent/Test

(In reply to comment #7)

We could do so, but it is pretty low priority for us. Note that you can also
point the web service to a page in the user namespace like this:

http://parsoid.wmflabs.org/zh/User:Liangent/Test

Now linkprefix is applied incorrectly. Still the same use case: [[中]]cjk[[文]]

Expected: [[中]]cjk[[文]]

Actual: [[中]][[文|cjk文]]

There's still a patch in that depends on a core patch. It's not testable on zhwiki until the core patch is deployed. I guess that will happen at wmf11, which is (at most) four weeks away from zhwiki. But it may have fixed this.

If someone could test it locally, it's at https://gerrit.wikimedia.org/r/50814 and the core patch is in the latest mediawiki (from git).

cscott added a comment.Oct 2 2013, 6:47 PM

This should be fixed now. The core patch has been merged for some time. Liangent, can you retest?

This seems fixed, but I found bug 54891 when testing it. Not sure whether this should be considered a bug on VisualEditor side or in Parsoid serializer.