Change Details

Split text into words or similar linguistic units, with acceptable accuracy level. == Background [x] Literature Review [x] Past tools survey [x] Split languages into whitespace and non-whitespace-delimited languages == Whitespace-delimited languages [x] Implement basic rule-based word segmenter ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/19|Issue 19]]) [] Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) for these languages. ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/22|Issue 22]]) == Non-whitespace-delimited languages [] Setup sentencepiece training environment ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/18|Issue 18]]) [] Train single all-non-whitespace language sentencepiece model [] Train different sentencepiece models for each language family [] Evaluate sentencepiece approach == Performance / Evaluation [] Evaluation datasets ([[https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/issues/21|Issue 21]])