Maniphest T101917

Word breaking rules used in VisualEditor do not work optimally in Japanese
Open, LowPublic
Actions

Assigned To

None

Authored By

	nshahquinn-wmf
	Jun 10 2015, 12:32 AM

Description

Japanese texts almost never use spaces, so VisualEditor falls back to other general word separation rules. However, there's a flaw in the way these general rules apply to Japanese. Take, for example, the word «大学», which means "university" and is composed of Chinese characters which individually in Japanese mean "big" («大») and "study" («学»). When you place your cursor at the end of this word and click the link button, only the second character is selected by the link text. Japanese users feel both should be selected.

I talked to @HaithamS about this, and he said he found this behavior annoying as well. His proposed solution was that when looking for a word in Japanese, the word should be the longest continuous run of characters from any one character set (hiragana, katakana, or Chinese). This is essentially the extra word separation rule that CLDR adds for Japanese.

Is there a way we can address this without resorting to project-specific or language-specific word separation rules? Possibly not; for example, this rule for dealing with Chinese characters probably doesn't apply in Mandarin or Cantonese. If not, would it be crazy to have language-specific rules, with the language of a piece of text determined by its language annotation or, failing that, the project language?

Related Objects

Mentioned In: T128060: VisualEditor makes it easy to create partially linked words, when the user expects a fully linked one
T108566: Non-finalised input is added to the document when using the Anthy Japanese IME with VisualEditor
T109818: Make VisualEditor good enough to use with all common IMEs for Japanese

Event Timeline

nshahquinn-wmf created this task.Jun 10 2015, 12:32 AM

nshahquinn-wmf raised the priority of this task from to Low.

nshahquinn-wmf updated the task description. (Show Details)

nshahquinn-wmf added projects: VisualEditor, VisualEditor-ContentLanguage.

nshahquinn-wmf added subscribers: nshahquinn-wmf, dchan, • HaithamS.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 10 2015, 12:32 AM

nshahquinn-wmf renamed this task from Word break rules used to determine link text in VisualEditor do not work optimally in Japanese to Word breaking rules used in VisualEditor do not work optimally in Japanese.Jun 10 2015, 12:33 AM

nshahquinn-wmf set Security to None.

I totally agree the current behaviour's flawed for Japanese (and Chinese too). But as you say, the proposed rule from Japanese CLDR probably cannot be implemented in a language-independent way, as it would cause problems if it applied to Chinese text.

Is "longest character set run" actually a good rule for Japanese? Or just the least-bad simple rule? Browsing jawiki randomly, I see many long blocks of kanji that don't seem to form a single word (or sometimes even a coherent phrase), e.g.: "東京都新宿区早稲田町", "外国為替証拠金取引", "異常事態発生". I can't read the runs of kana so I can't comment on those :-)

I note that ICU uses a unified CJK dictionary and dynamic programming algorithms to break Chinese character runs, thus avoiding the need for language specificity. That might be challenging to do efficiently on the client, though :-/

Language-specific word breaking rules wouldn't be crazy, but it's worth considering how much benefit the work would bring. If this is for links in particular, would it be better to do UI work to make it easier to change the extent of the link?

• Elitre subscribed.Jun 14 2015, 5:52 PM

Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.Jun 18 2015, 12:55 AM

In T101917#1352218, @dchan wrote:

Is "longest character set run" actually a good rule for Japanese? Or just the least-bad simple rule?

It's not really a good rule. A lot of words are made of a kanji character followed by hiragana characters, like 書く (kaku, write). If you put the cursor after く then press the link button, VE should select 書く, not just く. It should also select 書く if you put the cursor after 書.

Other words are followed by particles, which don't belong to the word but don't belong to the following text either. For example, take the sentence 鳥がたくさん. This sentence is made from 鳥 (tori, bird), が (ga, the particle roughly meaning "is"), and たくさん (takusan, a lot). The sentence means "There are a lot of birds". If you put the cursor after 鳥 then VE should select just 鳥; if you put the cursor after が then VE should select just が, or maybe 鳥が; and if you put the cursor after た, く, さ, or ん then it should select たくさん. Selecting がたくさん would be a mistake, as that isn't a word.

There are also many compound words that have both kanji and hiragana in the middle, for example 歩き出す (arukidasu, to start walking). This whole word should be selected, whether you put the cursor after 歩, き, 出, or す.

You're also correct about long kanji runs. These don't necessarily form a single word, and you can also join together prefixes, suffixes, and whole words to make new words. For example, you can have 東京 (Toukyou, Tokyo), 東京都 (Toukyou-to, Tokyo City), and 東京都民 (Toukyou-tomin, Tokyo City residents). Whether 東京都民 is one word or two depends on your point of view.

Katakana, thankfully, will generally adhere to the "longest character set run" rule.

A mildly better rule than "longest character set run" might be "longest run of kanji followed by longest run of hiragana, or else longest character set run". However, neither of these will be accurate for a significant portion of the input. To get good results you would probably need to use a dictionary.

Trizek-WMF mentioned this in T109818: Make VisualEditor good enough to use with all common IMEs for Japanese.Aug 28 2015, 8:58 AM

Trizek-WMF added a parent task: T109818: Make VisualEditor good enough to use with all common IMEs for Japanese.Aug 31 2015, 4:59 PM

MrStradivarius mentioned this in T108566: Non-finalised input is added to the document when using the Anthy Japanese IME with VisualEditor.Sep 24 2015, 3:25 PM

Jdforrester-WMF removed a parent task: T109818: Make VisualEditor good enough to use with all common IMEs for Japanese.Dec 8 2015, 6:40 PM

• Elitre mentioned this in T128060: VisualEditor makes it easy to create partially linked words, when the user expects a fully linked one.Mar 11 2016, 12:48 PM

nshahquinn-wmf added a project: I18n.Jan 22 2018, 9:38 PM

Word breaking rules used in VisualEditor do not work optimally in JapaneseOpen, LowPublicActions

Description

Related Objects

Event Timeline

Word breaking rules used in VisualEditor do not work optimally in Japanese
Open, LowPublic
Actions