Page MenuHomePhabricator

Tokenization of "word" things for CJK
Open, LowestPublic

Description

Figure out a useful tokenization strategy for CJK languages.

Event Timeline

Halfak created this task.Sep 2 2015, 1:56 PM
Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 2 2015, 1:56 PM
Liuxinyu970226 added a subscriber: Liuxinyu970226.
nyxtom added a subscriber: nyxtom.Nov 2 2015, 6:09 PM

Perhaps considering a Hidden Markov Model implementation. I believe Lucene 3.0 uses this approach for its CJK Tokenization.

Halfak added a comment.Nov 2 2015, 6:31 PM

Know where we could find one of those in python? I suppose we could also build our own if we had a sufficiently comprehensive set of words to learn the transition probabilities from.

nyxtom added a comment.Nov 2 2015, 7:00 PM

At first glance, I would say we could use some treebanks such as https://catalog.ldc.upenn.edu/LDC2013T21 for Chinese, not sure about the others. Alternatively, there's http://cjklib.org/0.3/ which may be worth looking into as a starting point.

nyxtom added a comment.Nov 3 2015, 4:35 AM

Although thinking about this more, you have to consider that ambiguity of meaning when segmenting "words" can lead to poor information retrieval issues.

https://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation

That being said, I came across an article about using Wikipedia as a resource for doing n-gram mutual information for word segmenting on Chinese. This method could potentially be applied to other languages.

http://www.cs.otago.ac.nz/homepages/andrew/papers/2009-9.pdf

nyxtom added a comment.Nov 3 2015, 4:43 AM

Slightly unrelated but I thought this was interesting:

http://batterseapower.github.io/pinyin-toolkit/

And it leverages cjklib :)

Halfak moved this task from Untriaged to Ideas on the Scoring-platform-team board.

We can probably use ngrams in hashing vectorization to capture this type of signal. That might be easier than explicitly splitting words. See T128087

Then again, splitting words would be good for dictionary lookups.

Halfak triaged this task as Lowest priority.Aug 18 2016, 2:43 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 3 2017, 3:54 AM
Viztor added a subscriber: Viztor.Jun 18 2019, 8:20 AM