Split text into words or similar linguistic units, with acceptable accuracy level.
== Background
[x] Literature Review
[x] Past tools survey ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/17|issue17]])
[x] Split languages into whitespace and non-whitespace-delimited languages [Source](https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects)
== Whitespace-delimited languages
[x] Implement basic rule-based word segmenter ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/19|Issue 19]])
[] Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) for these languages. ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/22|Issue 22]], [https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/25](issue 25), [https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/22](issue 22),)
== Non-whitespace-delimited languages
[x] Setup sentencepiece training environment ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/13|issue 13]])
[x] Train single all-non-whitespace language sentencepiece model ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/18|Issue 18]])
[] Train different sentencepiece models for each language family
[x] Evaluate sentencepiece approach ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/21|issue 21]])
== Performance / Evaluation
[] Evaluation datasets ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/21|Issue 21]])