In {T338292} we added a sentence segmentation system to MinT. It works as follows:
* Use a global sentence terminator characters list(source from unicode) and use that to find sentence boundaries.
* Make sure those boundaries are not ending with abbreviations. For this, we need abbreviation detection system and that is language specific.
There are en and ml abbreviation detection logic in the current code base. It need to be expanded to more languages - at least to the top 10 source languages we see in Content Translation.
Finding the most commonly used abbreviations in a language is not difficult. For example, see https://en.wikipedia.org/wiki/List_of_German_abbreviations
wiki-nlp-tools library also has a [[ https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/blob/main/src/mwtokenizer/assets/dict_abbr_filtered.json | collection ]].