- Identify languages not following white-space based word tokenization schemes
- Collect corpus for unsupervised training
- Setup sentencepiece training environment
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Appledora | T316941 NLP Tools for Content Gaps | |||
Resolved | Appledora | T328264 NLP Tools: Word Tokenization | |||
Declined | Appledora | T328267 Word Tokenization: Non-whitespace languages | |||
Declined | Appledora | T328269 Sentencepiece: Language Family Wise training | |||
Declined | Appledora | T328270 Sentencepiece: all non-whitespace languages |