Page MenuHomePhabricator

Investigate sentence splitting
Open, HighPublicSpike

Description

This task involves the work of identifying the technical approaches we will consider using for splitting an arbitrary range of text/content into discrete sentences.

Decision(s) to be made

  • 1. What – if any – technical approach is accurate and reliable enough for the Editing Team to depend on for identifying discrete sentences for the purpose of Edit Check being able to "automatically place a reference at the end of a sentence."

Investigation output

Per what @DLynch proposed in T324363#8561900, we will strive to make a prototype (or series of prototypes depending on how many approaches seem viable) that we can use in Patch demo to evaluate how effective a given approach is at splitting arbitrary content into discrete sentences.

Findings

  • Not currently able to support sentence splitting in Thai
    • Reason being: Thai does not have punctuation to signify the end of a sentence which the current sentence splitting approach depends on/cues off of.

Open question(s)

  • 1. What languages will a given sentence splitting approach need to work in for us to consider it viable?

Done

  • Next steps are documented for all Decision(s) to be made
  • Answers to all Open question(s) are documented
  • Next steps are identified for all Findings (should there be any)

Details

Related Objects

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptDec 2 2022, 8:18 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Following up on a discussion with @ppelberg : I recently helped start a project to do sentence tokenization in (ideally all) Wikipedia languages. The project is written in Python but there might be some reusable pieces. Always happy to talk. The quick summary:

  • We are doing our best to compile a list of sentence-ending punctuation that covers all Wikipedia languages
  • We then have a very simple sentence segmenter that essentially looks for the presence of those punctuation (with some caveats like ignoring decimal points)
  • We are currently working on trying to identify how big of an issue things like abbreviations are (we know they're very prevalent in e.g., German Wikipedia) and whether we can devise a simple solution for detecting and skipping over them.

Project: https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools
Our current list of full-stop punctuation: https://gitlab.wikimedia.org/repos/research/wiki-nlp-tools/-/blob/main/src/wikinlptools/config/symbols.py#L324

A complexity to bear in mind is that we're not splitting text, we're splitting either wikitext or HTML. This might cause some challenges around finding valid insertion points.

Given that sentence splitting is (probably) going to need a server-side component, it might make sense for for us to make a prototype tool that can be thrown onto patchdemo and output some sort of visualization of an article and how it'd be split so we can get an idea of how viable this is for placement.

  1. What languages will a given sentence splitting approach need to work in for us to consider it viable?

While a number of languages have unique sentence-ending punctuation, the list of sentence-ending punctuation I shared earlier hopefully captures almost all of those so makes this pretty feasible for almost all Wikipedia languages. The major caveats that we're currently aware of when it comes to language-specific trickiness:

  • German Wikipedia uses a lot of abbreviations that lead simple methods to split the text far too often.
  • Thai doesn't have punctuation so the only straightforward thing there without developing a custom segmentation approach is to split by paragraph -- I haven't talked with Search to see if they do anything special there that could be borrowed from.
matmarex added subscribers: dchan, matmarex.

Meeting notes from today:

ppelberg renamed this task from [edit check] Investigate sentence splitting to Investigate sentence splitting.Feb 6 2023, 5:46 PM

We're currently envisaging two uses for sentence segmentation in Edit Check:

  1. Counting sentences added. Used to decide whether to trigger the reference check.
  2. Suggesting sentence boundaries as locations to add a new citation.

As @Isaac pointed out, sentence segmentation is script dependent. For some scripts/languages, there are unambiguous sentence terminators, such as U+3002 (。) IDEOGRAPHIC FULL STOP in Chinese or Japanese, or U+0964 (।) DEVANAGARI DANDA in Indic languages. For other scripts, sentence terminators can be ambiguous, e.g. in Latin, Cyrillic etc. the character U+002E (.) FULL STOP is used to end sentences but also for abbreviations, decimals etc.

Unicode TR29, as mentioned by @cscott earlier, ( https://www.unicode.org/reports/tr29/#Sentence_Boundaries ) gives a language-independent algorithm that is generally quite accurate, but for some scripts is not 100% accurate. It describes how to extend the algorithm using language-specific lexical data, e.g. to identify whether a Latin FULL STOP character is part of a common abbreviation and therefore unlikely to be acting as a sentence terminator. The ICU library (with PHP bindings) implements the TR29 rules, augmented with lexical data. But note that for sentence segmentation, there's not actually very much data there! It's just abbreviation lists for seven languages: de en es fr it pt ru (totalling a few hundred abbreviations: see https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr ). ICU contains a lot of data for other types of segmentation (e.g. what is in effect a list of Chinese words), but not that much for sentence segmentation. Using the lexical data improves sentence segmentation accuracy, but not to 100%.

So implementing this with server-side ICU is a possibility. However, I think this is not the best option, because of the particular requirements for Edit Check, which I'll explain.

Edit Check requirements

For counting sentences, we do not need 100% accuracy: a confident answer could be counted as 1 and a tentative answer as, say, 0.5. For suggesting sentence boundaries, we also do not need 100% accuracy, but it is better to have a false positive (offering an unsuitable location that the user can simply ignore) than a false negative (not offering the correct location, making it difficult for the user to place the citation there).

Once we have the sentence boundary, we do need to support per-language configuration of where exactly citations go at the end of a sentence, relative to closing parentheses, sentence terminators and whitespace. This is because different wikis have different conventions, and it would be disruptive if an automatic tool does not respect the conventions. Here are some examples, where I've marked the citation placement with "█".

enwiki: … barely registered in my mind."█
eswiki: … y manifestó «profunda gratitud y gran humildad».█​
frwiki: … vous faites appel à Roy »█.
jawiki: …(生まれはニャンザ州ラチュオニョ県カニャディアン村█)。
kowiki: … 생각하는 어떤 것을 선택할 수도 있다.█
ruwiki: … на берегах реки много парков и набережных█.
zhwiki: … 1532年法國作家弗朗索瓦·拉伯雷的《巨人傳》█。

You can see we have to support a number of different conventions about whether the citation goes before or after the sentence terminator, close parenthesis, whitespace etc. To support this, we'd really want to use Unicode character data to identify parentheses, quotation marks, whitespace and sentence terminators from all scripts.

Therefore I recommend we don't use ICU, but implement this in our own code instead. The Edit Check code will need direct access to the Unicode character data in any case, in order to place references correctly. (This is similar in essence to the Unicode data we already make available in UnicodeJS to find word boundaries). Given this data, it is a small step to implement the TR29 rules ourselves, which would give us more flexibility to tune the algorithm. For example, we’d prefer false positives to false negatives, and we may not be satisfied with the trade-off built into ICU. Additionally, we could collect and use our own lexical data for abbreviations languages other than the seven supported by ICU (de en es fr it pt ru). The custom segmenter can also be client-side code, giving us flexibility to perform sentence segmentation without performing a server round-trip. This may make more types of UX innovation possible in the future.

Change 893832 had a related patch set uploaded (by Divec; author: Divec):

[unicodejs@master] WIP: sentencebreak

https://gerrit.wikimedia.org/r/893832

Proof-of-concept patch set to surface sentence segmentation in VisualEditor: https://gerrit.wikimedia.org/r/c/VisualEditor/VisualEditor/+/961095

Patchdemo instance of the above patch set (thanks @DLynch) https://patchdemo.wmflabs.org/wikis/4b31a139ad/w/index.php?title=Douglas_Adams&veaction=edit

Press Enter in a paragraph to detect sentence boundaries. The following characters are added around the sentence boundary:

⓿ immediately before the sentence terminator (e.g. ? or ! or . or 。)
❶ after the sentence terminator, but before any immediately following closing punctuation (e.g. brackets or quotes)
❷ after closing brackets or quotes, but before trailing whitespace
❸ after trailing whitespace

Sometimes the sentence boundary is marked by an ambiguous terminator (e.g. the Latin FULL STOP). Such terminators have dual use for another purpose (e.g. the Latin FULL STOP is used for abbreviations). Therefore the algorithm cannot be 100% certain this is actually a sentence boundary. This demo denotes the ambiguity by inserting alternative lighter characters ⓪①②③.

Examining the code's behaviour on the citation placement examples above, we can see that positions ⓿❶❷❸ would be sufficient for all those examples, except the Japanese close bracket, for which the citation placement comes before the Unicode TR29 definition of a sentence boundary would begin.

image.png (464×946 px, 88 KB)

In case it helps, language team had a very similar requirement for our machine translation service(MinT) and for CX-cxserver. We just published our sentence segmentation library in python and javascript. It also clusters the references along with the previous sentence. It is designed to support large number of languages and custom rules per languages are possible by design.

JS library demo https://santhoshtr.github.io/sentencex-js/
NPM package https://www.npmjs.com/package/sentencex
Python package https://pypi.org/project/sentencex