Nov 3 2015
Slightly unrelated but I thought this was interesting:
Although thinking about this more, you have to consider that ambiguity of meaning when segmenting "words" can lead to poor information retrieval issues.
Nov 2 2015
At first glance, I would say we could use some treebanks such as https://catalog.ldc.upenn.edu/LDC2013T21 for Chinese, not sure about the others. Alternatively, there's http://cjklib.org/0.3/ which may be worth looking into as a starting point.
Perhaps considering a Hidden Markov Model implementation. I believe Lucene 3.0 uses this approach for its CJK Tokenization.