T331080 will make it possible for the Editing Team to observe how the sentence splitting work we've done in T324363 performs on Wikipedia articles published on production wikis.
This task involves the work of evaluating said "performance."
Decisions to be made
Answering the "Open questions" below will help the Editing Team decide the following...
- D1: What – if any – revisions will the Editing Team need to make to the assumption that "Edit Check will automatically be able to place references at the end of a sentence that someone has added"?
- D2: What – if any – adjustments will the Editing Team need to make to the sentence splitting approach T324363 implements before we can be confident offering volunteers to specify what qualifies as new content in terms of sentences rather than characters, as is currently implemented?
Open Questions
- 1. Can the approach T324363 implements effectively find the end of sentences across languages?
- 2. How do things "look" in ambiguous cases where there is a possible, but not certain, end of a sentence?
- Note: knowing the answer to the above will help us to identify potential patterns that unify/explain cases where the sentence splitting/detection is not working as expected.
- As noted in T324363#9200149, the current approach denotes sentence boundaries by inserting inserting alternative lighter characters ⓪①②③.
- 3. How might we make it easy to know how many sentences the sentence splitting algorithm "thought" a particular edit added so that we can compare it to the actual number of sentences that were added?
Requirements
Use the edit tag T347644 introduces in a range of languages and document the cases where the sentence detection method T324363 implements fails to detect the end of a sentence and/or inaccurately "counts" the number of sentence(s) a given edit adds.
- Where "range of languages" in this context means one of each of the following languages:
- Notable linguistic features
- Orientations
- LTR language
- .
- RTL language
- .
- LTR language
- Scripts/Alphabets
- Arabic
- Arabic
- Chinese logographies
- .
- Cyrillic
- .
- Indic
- .
- Kanji
- Chinese
- Japanese
- Latin
- .
- Arabic
Findings
TBD