Figure out a useful tokenization strategy for CJK languages.
|Open||None||T227094 Update RC Filters for new ORES capacities (July, 2019)|
|Resolved||SBisson||T225561 Update ORES thresholds for nlwiki|
|Open||None||T223273 Update srwiki thresholds for goodfaith model|
|Resolved||SBisson||T225562 Deploy ORES filters for zhwiki|
|Open||None||T225563 Deploy ORES filters for jawiki|
|Resolved||Halfak||T224484 ORES deployment: Early June|
|Resolved||Halfak||T224481 Train/test zhwiki editquality models|
|Resolved||Halfak||T223382 Improvements to ORES localization and support|
|Resolved||Halfak||T109366 Chinese language utilities|
|Open||None||T111178 Generate stopwords for CJK languages|
|Open||None||T111179 Tokenization of "word" things for CJK|
Although thinking about this more, you have to consider that ambiguity of meaning when segmenting "words" can lead to poor information retrieval issues.
That being said, I came across an article about using Wikipedia as a resource for doing n-gram mutual information for word segmenting on Chinese. This method could potentially be applied to other languages.