Page MenuHomePhabricator

Word Tokenization: Non-whitespace languages
Closed, DeclinedPublic

Description

  • Identify languages not following white-space based word tokenization schemes
  • Collect corpus for unsupervised training
  • Setup sentencepiece training environment