Page MenuHomePhabricator

NLP Tools: Word Tokenization
Closed, ResolvedPublic

Description

Split text into words or similar linguistic units, with acceptable accuracy level.

Background

  • Literature Review
  • Past tools survey (issue17)
  • Split languages into whitespace and non-whitespace-delimited languages Source

Whitespace-delimited languages

  • Implement basic rule-based word segmenter (Issue 19)
  • Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) for these languages. (Issue 22, issue 25, issue 22 )

Non-whitespace-delimited languages

  • Setup sentencepiece training environment (issue 13)
  • Train single all-non-whitespace language sentencepiece model (Issue 18)
  • Train different sentencepiece models for each language family
  • Evaluate sentencepiece approach (issue 21)

Misc

  • Treat numbers like punctuations (issue 25)

Performance / Evaluation

Event Timeline

Isaac renamed this task from Word Tokenization to NLP Tools: Word Tokenization.Feb 20 2023, 6:08 PM
Isaac updated the task description. (Show Details)
Appledora updated the task description. (Show Details)