Page MenuHomePhabricator

Expand sentence segmentation system
Closed, ResolvedPublic

Description

In T338292: Add sentence segmenter feature we added a sentence segmentation system to MinT. It works as follows:

  • Use a global sentence terminator characters list(source from unicode) and use that to find sentence boundaries.
  • Make sure those boundaries are not ending with abbreviations. For this, we need abbreviation detection system and that is language specific.

There are en and ml abbreviation detection logic in the current code base. It need to be expanded to more languages - at least to the top 10 source languages we see in Content Translation.

Finding the most commonly used abbreviations in a language is not difficult. For example, see https://en.wikipedia.org/wiki/List_of_German_abbreviations
wiki-nlp-tools library also has a collection.

Result

Detailed blog post: Blog Post: sentencex: Empowering NLP with Multilingual Sentence Extraction

sentencex python library

sentencex js library

Event Timeline

Pginer-WMF updated the task description. (Show Details)
Pginer-WMF updated the task description. (Show Details)