Page MenuHomePhabricator

Investigate feasibility of leveraging the sentencex library within Edit Check
Closed, ResolvedPublic

Description

The Language Team recently introduced sentencex, a library for sentence splitting available in both Python and JavaScript.

See more in sentencex: Empowering NLP with Multilingual Sentence Extraction.

This task involves the work with investigating the feasibility of using sentencex to power the sentence splitting functionality features like T347643 depend on.

Decisions to be made

  • Will the Editing Team leverage sentencex to power the sentence splitting functionality Edit Check features like T347643 depend on.

Decision: The Editing Team will not use sentencex for the time being as we are not ready to commit to a sentence boundary API. We can revisit this in future as circumstances change.

Open questions

  • 1. What requirements must any sentence splitting approach need to meet in order to support features like T347643?

Sentence splitting is inherently heuristic (probabilistic) around ambiguous terminators (such as English full stop/period, which can denote a sentence end but also an abbreviation). For our solution to handle the ambiguity, there may be a contribution from outside the strictly algorithmic code; for instance, UI design may surface the concept for the user to disambiguate, and for a particular use case, we might prefer a default of splitting (or indeed a default of not splitting). We need the freedom to make these decisions iteratively. In short, our requirements from the sentence splitting code are still fluid, and so the agility of keeping the code within Edit Check seems best for now.

  • 2. What – if any – adjustments/improvements would need to be made to sentencex to meet the requirements named through "1." ?

N/A

  • 3. To what extent are the "adjustments/improvements" identified through "2." feasible to make?

N/A

Event Timeline

I've taken a look at sentencex-js specifically, which I understand has been ported from the sentencex original written in python. It looks good! There is certainly significant overlap with the Editing Team's UnicodeJS sentencebreak code.

Sentencex contains language-specific abbreviation lists for just under 30 languages[1], used to heuristically guess that a FULL STOP (.) does not end a sentence (for example, it would correctly identify that there is no sentence boundary in "Dr. Curie"). We don't currently have that info in UnicodeJS and maintaining such lists would be a burden for us. We'd previously envisaged making this something each wiki community could maintain for their language-specific needs.

Some issues would arise with us using sentencex:

  • The segment method only returns a list of sentences: it does not provide a Unicode TR29-style analysis of the punctuation at the sentence boundary, which we need if we want to support wikis with different reference placement conventions. (On the other hand, there is specific provision for identifying numbered referernces so they don't interfere with segmentation).
  • There's no indication which sentence breaks are certain, and which ones are guesses. We think this information would be helpful for heuristics (e.g. when counting sentences we could count ambiguous breaks as having less weight).
  • Sometimes it would be helpful to have a maximal segmentation (including all ambiguous breaks) or a minimal segmentation (only including fairly unambiguous breaks), depending on whether the consequences are worse for false positives or false negatives in the particular use case.

We should discuss with the Language Team whether they would consider expanding the API to return this information. Perhaps there could be a new method segmentDetailed that returns the full details, and then segment could just become a wrapper around segmentDetailed that drops the details and returns the sentence list (as now). But even if we don't use sentencex in full, we could still potentially import the abbreviation lists.

[1] am ar bg bn ca da de el en es fi fr gu hi hy ja kk kn ml mr my nl or pa pl pt ru sk ta te. Most of these have abbreviation lists. A few only specify fallback languages (e.g. Catalan -> Spanish) or custom punctuation (e.g. ';' ends a Greek sentence). One of them is empty and I believe unneeded (Japanese doesn't have ambiguous sentence terminators).

We have decided the Editing Team will not use sentencex for the time being as we are not ready to commit to a sentence boundary API. We can revisit this in future as circumstances change.

This is because we are currently undecided on the best way to handle ambiguity. Sentence splitting is inherently heuristic (probabilistic) around ambiguous terminators (such as English full stop/period, which can denote a sentence end but also an abbreviation). For our solution to handle the ambiguity, there may be a contribution from outside the strictly algorithmic code; for instance, UI design may surface the concept for the user to disambiguate, and for a particular use case, we might prefer a default of splitting (or indeed a default of not splitting).

In short, our requirements from the sentence splitting code are still fluid, and so the agility of keeping the code within Edit Check seems best for now.

dchan updated the task description. (Show Details)