Full sentences may appear inside parentheses, what is consensus regarding those?
Nazia: we also have to consider sentences inside quotations. i.e.: She turned to him, 'This is great. ' she said.
which currently gets split as : "She turned to him, 'This is great. " and "' she said. "
Another example: fo ['Pamela Ferguson professari í Dundee fróðskaparsetrinum ger vart við, at “tíðindafólk tykjast at ganga á markinum, um tey almannagera myndir o.s.fr. av fólki undir illgruna.” Hetta gevur teimum fleiri smá støð at goyma seg fyri rándjórum.'] no-split
Does not recognize to split here: ...fólki undir illgruna."
Isaac:
My intuition / thoughts:
- Both of these would have similar technical implemenations -- track the start of a quote/parenthesis and don't close out the sentence until you either reach the end of the quote/parenthesis or some max number of characters (for situations with broken syntax). This probably would look like:
- Inspect each potential sentence and searching for an open quote or parenthesis.
- If found: inspect the following sentences for the close or until the character limit is met.
- If close found: merge the sentences
- If found: inspect the following sentences for the close or until the character limit is met.
- Inspect each potential sentence and searching for an open quote or parenthesis.
- For both, it's probably also reasonable to break the sentence within the quote/parentheses in many circumstances so it's not clear that this would improve things by much and so might not be worth the additional complexity/latency/bugs it might introduce. The pseudocode above would be a relatively simple regex for the initial check but still requires probably two additional passes over every sentence (one for quotes; one for parentheses) and code to then do follow-up searches and merge when the conditions are met.
- Altogether, I rate it as a very low-priority issue but one worth considering if we start to see bugs of this nature appear in our more exhaustive testing/evaluation
I think the latter example is the more important one too -- where a full-stop is missed because it's followed by a quotation mark or parenthesis. Though I think generally much better to miss splitting a sentence (false negative) than to split one in the wrong place (false positive).