Page MenuHomePhabricator

Word breaking rules used in VisualEditor do not work optimally in Japanese
Open, LowPublic

Description

Japanese texts almost never use spaces, so VisualEditor falls back to other general word separation rules. However, there's a flaw in the way these general rules apply to Japanese. Take, for example, the word «大学», which means "university" and is composed of Chinese characters which individually in Japanese mean "big" («大») and "study" («学»). When you place your cursor at the end of this word and click the link button, only the second character is selected by the link text. Japanese users feel both should be selected.

I talked to @HaithamS about this, and he said he found this behavior annoying as well. His proposed solution was that when looking for a word in Japanese, the word should be the longest continuous run of characters from any one character set (hiragana, katakana, or Chinese). This is essentially the extra word separation rule that CLDR adds for Japanese.

Is there a way we can address this without resorting to project-specific or language-specific word separation rules? Possibly not; for example, this rule for dealing with Chinese characters probably doesn't apply in Mandarin or Cantonese. If not, would it be crazy to have language-specific rules, with the language of a piece of text determined by its language annotation or, failing that, the project language?

Event Timeline

nshahquinn-wmf raised the priority of this task from to Low.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf renamed this task from Word break rules used to determine link text in VisualEditor do not work optimally in Japanese to Word breaking rules used in VisualEditor do not work optimally in Japanese.Jun 10 2015, 12:33 AM
nshahquinn-wmf set Security to None.

I totally agree the current behaviour's flawed for Japanese (and Chinese too). But as you say, the proposed rule from Japanese CLDR probably cannot be implemented in a language-independent way, as it would cause problems if it applied to Chinese text.

Is "longest character set run" actually a good rule for Japanese? Or just the least-bad simple rule? Browsing jawiki randomly, I see many long blocks of kanji that don't seem to form a single word (or sometimes even a coherent phrase), e.g.: "東京都新宿区早稲田町", "外国為替証拠金取引", "異常事態発生". I can't read the runs of kana so I can't comment on those :-)

I note that ICU uses a unified CJK dictionary and dynamic programming algorithms to break Chinese character runs, thus avoiding the need for language specificity. That might be challenging to do efficiently on the client, though :-/

Language-specific word breaking rules wouldn't be crazy, but it's worth considering how much benefit the work would bring. If this is for links in particular, would it be better to do UI work to make it easier to change the extent of the link?

Is "longest character set run" actually a good rule for Japanese? Or just the least-bad simple rule?

It's not really a good rule. A lot of words are made of a kanji character followed by hiragana characters, like 書く (kaku, write). If you put the cursor after く then press the link button, VE should select 書く, not just く. It should also select 書く if you put the cursor after 書.

Other words are followed by particles, which don't belong to the word but don't belong to the following text either. For example, take the sentence 鳥がたくさん. This sentence is made from 鳥 (tori, bird), が (ga, the particle roughly meaning "is"), and たくさん (takusan, a lot). The sentence means "There are a lot of birds". If you put the cursor after 鳥 then VE should select just 鳥; if you put the cursor after が then VE should select just が, or maybe 鳥が; and if you put the cursor after た, く, さ, or ん then it should select たくさん. Selecting がたくさん would be a mistake, as that isn't a word.

There are also many compound words that have both kanji and hiragana in the middle, for example 歩き出す (arukidasu, to start walking). This whole word should be selected, whether you put the cursor after 歩, き, 出, or す.

You're also correct about long kanji runs. These don't necessarily form a single word, and you can also join together prefixes, suffixes, and whole words to make new words. For example, you can have 東京 (Toukyou, Tokyo), 東京都 (Toukyou-to, Tokyo City), and 東京都民 (Toukyou-tomin, Tokyo City residents). Whether 東京都民 is one word or two depends on your point of view.

Katakana, thankfully, will generally adhere to the "longest character set run" rule.

A mildly better rule than "longest character set run" might be "longest run of kanji followed by longest run of hiragana, or else longest character set run". However, neither of these will be accurate for a significant portion of the input. To get good results you would probably need to use a dictionary.