Page MenuHomePhabricator

Evaluate reliability of sentence splitting approach
Open, HighPublicSpike

Description

T331080 will make it possible for the Editing Team to observe how the sentence splitting work we've done in T324363 performs on Wikipedia articles published on production wikis.

This task involves the work of evaluating said "performance."

Decisions to be made

Answering the "Open questions" below will help the Editing Team decide the following...

  • D1: What – if any – revisions will the Editing Team need to make to the assumption that "Edit Check will automatically be able to place references at the end of a sentence that someone has added"?
  • D2: What – if any – adjustments will the Editing Team need to make to the sentence splitting approach T324363 implements before we can be confident offering volunteers to specify what qualifies as new content in terms of sentences rather than characters, as is currently implemented?

Open Questions

  • 1. Can the approach T324363 implements effectively find the end of sentences across languages?
  • 2. How do things "look" in ambiguous cases where there is a possible, but not certain, end of a sentence?
    • Note: knowing the answer to the above will help us to identify potential patterns that unify/explain cases where the sentence splitting/detection is not working as expected.
    • As noted in T324363#9200149, the current approach denotes sentence boundaries by inserting inserting alternative lighter characters ⓪①②③.
  • 3. How might we make it easy to know how many sentences the sentence splitting algorithm "thought" a particular edit added so that we can compare it to the actual number of sentences that were added?

Requirements

Use the edit tag T347644 introduces in a range of languages and document the cases where the sentence detection method T324363 implements fails to detect the end of a sentence and/or inaccurately "counts" the number of sentence(s) a given edit adds.

  • Where "range of languages" in this context means one of each of the following languages:
    • Notable linguistic features
      • Bangla (unique full-stop punctuation)
      • Armenian (unique full-stop punctuation)
      • German + Hungarian (prevalence of abbreviations, noun capitalization)
    • Orientations
      • LTR language
        • .
      • RTL language
        • .
    • Scripts/Alphabets
      • Arabic
        • Arabic
      • Chinese logographies
        • .
      • Cyrillic
        • .
      • Indic
        • .
      • Kanji
        • Chinese
        • Japanese
      • Latin
        • .

Findings

TBD

Event Timeline

ppelberg created this task.

I like the set of languages/scripts you already have for evaluation. I know you're already aware that it will fail for Thai given the lack of explicit punctuation there. A few suggested inclusions:

  • German (because that's where we see the most abbreviations -- i.e. likely false-positives for sentence splits).
  • Bangla and Armenian are other languages that have unique full stop punctuation that we've missed in the past and are worth checking as well.

I think these are good suggestions. I'd also suggest adding an Indic script.

Agreed German would be sensible — also because each noun has a capital letter, making it less likely a period in the middle of a sentence will be followed by a lower-case letter.

Chinese / Japanese "。" (U+3002 IDEOGRAPHIC FULL STOP) is unambiguous, but it'd be useful to check we can handle citation placement rules for zhwiki and others.

I'm not worried about RTL support for the logic of sentence splitting per se, but the visual effects of bidirectionality might raise interesting UI issues for citation placement, and it makes sense to surface these early too, so I agree we should test on an RTL wiki.

@Isaac + @dchan: thank you for sharing the above; I've updated the requirements in the task description to reflect the suggestions you've made in T331686#8683919 and T331686#8702108.

Note: I know the current layout of "Requirements" section isn't the easiest to parse. I plan to come back and turn that into a more legible table.

  • German (because that's where we see the most abbreviations -- i.e. likely false-positives for sentence splits).

A similar issue is Hungarian where the full date format includes full stops followed by spaces. Not sure how common that is in other languages.

  • German (because that's where we see the most abbreviations -- i.e. likely false-positives for sentence splits).

A similar issue is Hungarian where the full date format includes full stops followed by spaces. Not sure how common that is in other languages.

Great spot, @Tgr; task description updated.

@dchan: have you already completed the review this task is "asking" for? If yes, can you please share the results of that investigation? If not, can you please work on that?

I ask the above thinking ahead to T338907 and wanting to be sure we're confident in the technical approach we've taken prior to deployment.