Page MenuHomePhabricator

[wiki-nlp-tools] Sentence Tokenization: Keeping track of parentheses and quotations
Open, Needs TriagePublic

Description

Full sentences may appear inside parentheses, what is consensus regarding those?

Nazia: we also have to consider sentences inside quotations. i.e.: She turned to him, 'This is great. ' she said.

which currently gets split as : "She turned to him, 'This is great. " and "' she said. "

Another example: fo ['Pamela Ferguson professari í Dundee fróðskaparsetrinum ger vart við, at “tíðindafólk tykjast at ganga á markinum, um tey almannagera myndir o.s.fr. av fólki undir illgruna.” Hetta gevur teimum fleiri smá støð at goyma seg fyri rándjórum.'] no-split

Does not recognize to split here: ...fólki undir illgruna."

Isaac:
My intuition / thoughts:

  • Both of these would have similar technical implemenations -- track the start of a quote/parenthesis and don't close out the sentence until you either reach the end of the quote/parenthesis or some max number of characters (for situations with broken syntax). This probably would look like:
    • Inspect each potential sentence and searching for an open quote or parenthesis.
      • If found: inspect the following sentences for the close or until the character limit is met.
        • If close found: merge the sentences
  • For both, it's probably also reasonable to break the sentence within the quote/parentheses in many circumstances so it's not clear that this would improve things by much and so might not be worth the additional complexity/latency/bugs it might introduce. The pseudocode above would be a relatively simple regex for the initial check but still requires probably two additional passes over every sentence (one for quotes; one for parentheses) and code to then do follow-up searches and merge when the conditions are met.
  • Altogether, I rate it as a very low-priority issue but one worth considering if we start to see bugs of this nature appear in our more exhaustive testing/evaluation

I think the latter example is the more important one too -- where a full-stop is missed because it's followed by a quotation mark or parenthesis. Though I think generally much better to miss splitting a sentence (false negative) than to split one in the wrong place (false positive).

Event Timeline

MGerlach renamed this task from Sentence Tokenization: Keeping track of parentheses and quotations to [wiki-nlp-tools] Sentence Tokenization: Keeping track of parentheses and quotations.Mar 24 2026, 2:07 PM