The Language Team recently introduced sentencex, a library for sentence splitting available in both Python and JavaScript.
See more in sentencex: Empowering NLP with Multilingual Sentence Extraction.
This task involves the work with investigating the feasibility of using sentencex to power the sentence splitting functionality features like T347643 depend on.
Decisions to be made
- Will the Editing Team leverage sentencex to power the sentence splitting functionality Edit Check features like T347643 depend on.
Decision: The Editing Team will not use sentencex for the time being as we are not ready to commit to a sentence boundary API. We can revisit this in future as circumstances change.
Open questions
- 1. What requirements must any sentence splitting approach need to meet in order to support features like T347643?
Sentence splitting is inherently heuristic (probabilistic) around ambiguous terminators (such as English full stop/period, which can denote a sentence end but also an abbreviation). For our solution to handle the ambiguity, there may be a contribution from outside the strictly algorithmic code; for instance, UI design may surface the concept for the user to disambiguate, and for a particular use case, we might prefer a default of splitting (or indeed a default of not splitting). We need the freedom to make these decisions iteratively. In short, our requirements from the sentence splitting code are still fluid, and so the agility of keeping the code within Edit Check seems best for now.
- 2. What – if any – adjustments/improvements would need to be made to sentencex to meet the requirements named through "1." ?
N/A
- 3. To what extent are the "adjustments/improvements" identified through "2." feasible to make?
N/A