Vision
- Researchers could start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
- This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
- Each component is a Python library that is easily configurable but provides good default performance out-of-the-box
Current state
- We have a strong start on both wikitext -> plaintext (edit types library) and HTML -> plaintext (HTML dumps library). Both can see continued improvement / streamlining but provide the basic functionality.
- We have explored various approaches for sentence/word tokenization but our code is scattered across projects, still has known gaps, and has not been well-tested in many languages.
Applications:
- Structured Tasks
- Copy-edit: need to split article into individual sentences to feed into model.
- Add-a-link: split articles into sentence to capture appropriate context for each word in the model. Split sentences into words to know which tokens to evaluate for links.
- Citation-needed: need to split article into individual sentences to feed into citation-needed model.
- Metrics / Analysis
- Edit types: summarize what was changed by an edit on Wikipedia – e.g., # of sentences/words added.
- Readability: extract sentences to identify the number of entities per sentence as a proxy for readability.
- Quality model: number of sentences as a better proxy for amount of content than bytes.
- Extraction
- TextExtracts Extension: return first k sentences in an article.
- HTML Dumps: extract plaintext from article HTML – eventually might be nice to feed output into sentence tokenizer for even more control.
- Vandalism detection: feature generation would benefit from word tokenization and likely sentence tokenization as well.