Page MenuHomePhabricator

Add Japanese segmentation to cxserver
Closed, ResolvedPublic

Description

Japanese uses non-Western sentence separators, so it needs its own segmentation algorithm.

AFAIK, its needs are almost the same as those of Hindi, but instead of Danda ("।") it needs an ideographic full stop ("。"). This is what I made in https://gerrit.wikimedia.org/r/#/c/189984/ .

Niklas also suggests to consider "「", "」" and ".". AFAIK, "「" and "」" are like quotation marks and not sentence separators. As for ".", I'll need consultation from somebody who knows Japanese; my hunch is that it's not needed.

Event Timeline

Amire80 raised the priority of this task from to Medium.
Amire80 updated the task description. (Show Details)
Amire80 added subscribers: Amire80, dchan, Jsahleen, santhosh.

Change 189984 had a related patch set uploaded (by Amire80):
Add Japanese segmentation

https://gerrit.wikimedia.org/r/189984

Patch-For-Review

Change 189984 merged by jenkins-bot:
Add Japanese segmentation

https://gerrit.wikimedia.org/r/189984

A comment from a native Japanese speaker: Generally, the ideographic full stop "。" is the equivalent of a latin-script full stop and is the most common sentence separator. (Same goes for Chinese.)

"「" and "」" are indeed the equivalents of quotation marks.

The fullwidth full stop "." is also a valid way to end a sentence in horizontal writing mode for Japanese, but is less common except in certain technical documents. The distinction is a manual-of-style matter. Japanese Wikipedia uses the ideographic full stop "。", as is the case with any common usage. Although one cannot rule out the possibility of encountering the fullwidth form, I would say that it would be extremely rare.

Amire80 claimed this task.

Thanks for the comments, @Asahiko.

This is actually done long ago and I forgot to close it :)