Page MenuHomePhabricator

nyxtom (Thomas Holloway)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2015, 6:06 PM (299 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
Nyxtom [ Global Accounts ]

Recent Activity

Nov 3 2015

nyxtom added a comment to T111179: Tokenization of "word" things for CJK.

Slightly unrelated but I thought this was interesting:

Nov 3 2015, 4:43 AM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring
nyxtom added a comment to T111179: Tokenization of "word" things for CJK.

Although thinking about this more, you have to consider that ambiguity of meaning when segmenting "words" can lead to poor information retrieval issues.

Nov 3 2015, 4:35 AM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring

Nov 2 2015

nyxtom added a comment to T111179: Tokenization of "word" things for CJK.

At first glance, I would say we could use some treebanks such as https://catalog.ldc.upenn.edu/LDC2013T21 for Chinese, not sure about the others. Alternatively, there's http://cjklib.org/0.3/ which may be worth looking into as a starting point.

Nov 2 2015, 7:01 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring
nyxtom added a comment to T111179: Tokenization of "word" things for CJK.

Perhaps considering a Hidden Markov Model implementation. I believe Lucene 3.0 uses this approach for its CJK Tokenization.

Nov 2 2015, 6:09 PM · Machine-Learning-Team (Active Tasks), Chinese-Sites, artificial-intelligence, revscoring