Page MenuHomePhabricator

[wiki-nlp-tools] Split sentences on newline characters?
Open, Needs TriagePublic

Description

Currently, we don't seem to split when encountering the newline symbol. Maybe it is ok to assume that users split in paragraphs before sentence-tokenization. But I was surprised when getting really long sentences for disambiguation pages (where individual bullet points are only separated by “\n”). Do we want to consider adding this as a punctuation symbol?

One argument / nuance of this: our benchmarking data has some "sentences" that are many many lines long from list articles. Even if we don't fix this in the core code, we may want to adjust in our benchmark generation code both as demonstrating best practice and making the logs more readable 😄