Page MenuHomePhabricator

VisualEditor: ve.dm.SurfaceFragment.wordBoundaryPattern treats non-lower-ASCII word characters as boundaries
Closed, ResolvedPublic

Description

See http://inimino.org/~inimino/blog/javascript_cset for some work in this area.


Version: unspecified
Severity: major

Details

Reference
bz44085

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:22 AM
bzimport set Reference to bz44085.

Bit of clarification:

When the user clicks the link button in the toolbar and they haven't selected any text, we expand the selection in both directions from the cursor position and select the word the cursor is in, make that a link, then show the link inspector. The code that expands the selection to a full word is in ve.dm.SurfaceFragment, and apparently treats non-ASCII characters as word boundaries. The practical bug that this leads to is that if you put the cursor in the middle of "Möckernbrücke" (or "égalité", if you prefer French) and click the link button, only "ckernbr" (or "galit", respectively) will be selected and linkified. Obviously this is a problem for i18n in languages using an extended Latin alphabet like German, French and Polish, but it's a total nightmare for non-Latin languages like Russian, Hebrew and Japanese.

Acutually Chinese & Japanese don't have any word boundaries at all. The only way to detect them is with a dictionary. We'll need a special case for these languages so we don't end up selecting entire sentences.

http://xregexp.com/ has unicode character class support. We may be able to pick out the data we need from it instead of using the whole library.

To begin with a patch to add some test structure and fix what we have already: https://gerrit.wikimedia.org/r/#/c/53564

dchan added a comment.Mar 13 2013, 4:31 PM

If you're going to do lexicon-based word boundary detection in Chinese, maybe you could use a word list stored in a client-side Bloom Filter.

I don't know if it's as much of a problem in Japanese; you could probably use (?<=\P{Han})(?=\p{Han}) as a good start (i.e. there is a word break be.

As an incremental improvement I've expanded the letters and numbers groups to their Unicode categories: https://gerrit.wikimedia.org/r/#/c/53583/
We still need to think about which punctuation categories to add.

The Unicode standard has a fair amount to say on the matter. Ideally we would implement their standard.

http://www.unicode.org/reports/tr29/#Word_Boundaries

Like this: https://gerrit.wikimedia.org/r/#/c/54480 (well, apart from non-BMP characters...)