Split text into words or similar linguistic units, with acceptable accuracy level.
== Background
[x] Literature Review
[x] Past tools survey
[x] Split languages into whitespace and non-whitespace-delimited languages
== Whitespace-delimited languages
[x] Implement basic rule-based word segmenter ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/19|Issue 19]])
[] Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) for these languages. ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/22|Issue 22]])
== Non-whitespace-delimited languages
[] Setup sentencepiece training environment ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/18|Issue 18]])
[] Train single all-non-whitespace language sentencepiece model
[] Train different sentencepiece models for each language family
[] Evaluate sentencepiece approach
== Performance / Evaluation
[] Evaluation datasets ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/21|Issue 21]])