Tokenize sentences with an acceptable level of accuracy across different wikiproject languages. Will be largely language agnostic.
Background
- Literature Review
- Past tools survey (Issue 3)
Rule-based
- Build a global list of Unicode sentence terminators and use regex to split sentences (Issue 4)
Statistical
- Scrape likely abbreviations from all wiktionary projects (Issue 10)
- Filter those abbreviations based on Wikipedia occurrences (Issue 10)
- Incorporate language-specific abbreviation lists into sentence tokenization code (Issue 5)
Additional
- Build and compile additional rules required (contraction, abbreviations, different punctuation schemes etc) (issue 12 , issue 14 ,issue 7 , issue 9 )