Page MenuHomePhabricator

NLP Tools: Sentence Tokenization
Closed, ResolvedPublic

Description

Tokenize sentences with an acceptable level of accuracy across different wikiproject languages. Will be largely language agnostic.

Background

  • Literature Review
  • Past tools survey (Issue 3)

Rule-based

  • Build a global list of Unicode sentence terminators and use regex to split sentences (Issue 4)

Statistical

  • Scrape likely abbreviations from all wiktionary projects (Issue 10)
  • Filter those abbreviations based on Wikipedia occurrences (Issue 10)
  • Incorporate language-specific abbreviation lists into sentence tokenization code (Issue 5)

Additional

  • Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) (issue 12 , issue 14 ,issue 7 , issue 9 )

Performance / Evaluation

  • Initial basic benchmark of how sentence tokenization does on different edge cases (Issue 8)
  • Add unit tests (Issue 15)
  • Refactor code to help simplify / optimize performance (Issue 20)
  • Fuller evaluation of how addition of abbreviation-handling impacts performance (Issue 11)

Event Timeline

Isaac renamed this task from Sentence Tokenization to NLP Tools: Sentence Tokenization.Feb 20 2023, 5:19 PM
Appledora updated the task description. (Show Details)
Appledora updated the task description. (Show Details)