NLP Tools for Content Gaps
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Sep 2 2022, 3:41 PM

Description

Vision

Researchers could start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
Each component is a Python library that is easily configurable but provides good default performance out-of-the-box

Current state

We have a strong start on both wikitext -> plaintext (edit types library) and HTML -> plaintext (HTML dumps library). Both can see continued improvement / streamlining but provide the basic functionality.
We have explored various approaches for sentence/word tokenization but our code is scattered across projects, still has known gaps, and has not been well-tested in many languages.

Applications:

Structured Tasks
- Copy-edit: need to split article into individual sentences to feed into model.
- Add-a-link: split articles into sentence to capture appropriate context for each word in the model. Split sentences into words to know which tokens to evaluate for links.
- Citation-needed: need to split article into individual sentences to feed into citation-needed model.
Metrics / Analysis
- Edit types: summarize what was changed by an edit on Wikipedia – e.g., # of sentences/words added.
- Readability: extract sentences to identify the number of entities per sentence as a proxy for readability.
- Quality model: number of sentences as a better proxy for amount of content than bytes.
Extraction
- TextExtracts Extension: return first k sentences in an article.
- HTML Dumps: extract plaintext from article HTML – eventually might be nice to feed output into sentence tokenizer for even more control.
- Vandalism detection: feature generation would benefit from word tokenization and likely sentence tokenization as well.

Related Objects
Search...

Status	Assigned	Task
Resolved	Appledora	T316941 NLP Tools for Content Gaps
Resolved	Appledora	T328260 NLP Tools: Sentence Tokenization
Resolved	Appledora	T328261 Rule-based Sentence Tokenization
Resolved	Appledora	T328263 Consider abbreviations while sentence splitting
Declined	Appledora	T328272 Sentence Tokenization: Evaluation Pipeline
Resolved	Appledora	T328264 NLP Tools: Word Tokenization
Resolved	Appledora	T328265 Word Tokenization: White-spaced language Tokenization
Declined	Appledora	T328267 Word Tokenization: Non-whitespace languages
Declined	Appledora	T328269 Sentencepiece: Language Family Wise training
Declined	Appledora	T328270 Sentencepiece: all non-whitespace languages

Event Timeline

Known challenges:

Sentence tokenization:
- Fully language-inclusive list of sentence-ending punctuation? Can we check each language edition to see if any seem to have very long sentences or punctuation characters that show up commonly that we're not capturing?
- Abbreviations: can we devise a means of automatically building lists of common abbreviations in a language?
Word tokenization:
- Strategy for languages that don't use whitespace to separate words: can we train standard tokenizers – e.g., sentencepiece – that outperform bigrams without much additional overhead? Perhaps use Mediawiki fallback languages as language families?
Evaluation:
- Can we build a good starter test suite for tokenization?
- Are there other automated ways we can test the quality of our tokenization – e.g., add-a-link performance?

Claiming for now until I can pass off to NT

Fleshed out known challenges list and set plan for work:

Setup
- Gitlab repo
- License, basic README / CI, basic folder structure
Sentence Tokenization
- Build list of sentence-ending punctuation (full stops)
- Account for exceptions – e.g., abbreviations, decimal points, etc.
Word Tokenization
- Strategy for whitespace-delimited languages
- Strategy for non-whitespace-delimited languages
Evaluation
- Unit tests
- Automatic evaluation
Wikitext/HTML -> Plaintext
- HTML Dumps
Miscellanea
- Stemming?
- Demonstrating use-cases for library
- Blogpost / sharing out with Research community and inviting feedback
- Emergent items

Kick-off meeting the week of Sept 26!

Isaac reassigned this task from Isaac to Appledora.Sep 27 2022, 2:30 PM

update : [30.09.2022]

Setup the basic project structure on gitlab. PR#2
Started analysis on pre-established NLP packages (NLTK, Gensim, Spacy) PR#3

update: [07.10.2022]

Building a report on Sentence Tokenization link
Renewed focus on memory-footprint and compute- cost

update: [14.10.2022]

Finished the review on standard tools of Sentence Tokenization doc
Informal Report on Milestones link
Collecting resources for rule-based segmentation
Isaac created a meta page

Isaac mentioned this in T321086: Requesting access to analytics-privatedata-users for appledora.Oct 18 2022, 2:23 PM

Isaac moved this task from FY2022-23-Research-July-September to FY2022-23-Research-October-December on the Research board.Oct 19 2022, 5:30 PM

Isaac edited projects, added Research (FY2022-23-Research-October-December); removed Research (FY2022-23-Research-July-September).

update: [18.10.2022] and [25.10.2022]

compiled list of unicode sentence terminators
built a benchmark sample for four langauges ( EN ES DE AR)
implemented the naive rule-based sentence segmenter
collected dataset for testing and future supervised training

update: [04. 11.2022]

Server Onboarding
Building a deterministic Benchmark Module and dataset development
Going through the example pySpark notebook by Martin and other walkthrough documentations by Isaac

update:[11.11.2022]

Curated list of abbreviations for all languages with a wiktionary project.
Working on integrating the abbreviation search as a replacement scheme.

update: [18.11.2022] and [27.11.2022]

Implemented abbreviation replacement scheme
performance analysis of segmentation before and after abbreviation post-processing.
Implement a filtration scheme for the wiktionary abbreviations
Performance analysis of abbreviation filtration, across a range of frequency ratio threshold

update: [06.12.2022]

Addressed reviews on the abbreviation filtering scheme
discussion on sentence segmentation evaluation datasets
wrote algorithm for abbreviation filtering

update: [13.12.2022]

Adapted Martin's code on wikitext processing for abbreviation filtering
Moved to statmachine for running simulations using pyspark
Started literature review + background study on Word Tokenization

update: [20.12.2022]

Mostly spent time trying to get familiarized with pyspark
Updated the Word tokenization literature review with more information on existing opensource tools
Discovered some additional edge-cases on sentence tokenizations (e.g: parenthesis and quotation tracking)

update: [3.01.2023]

Finished adapting abbreviation filtering code for each wiki project
Got started on word tokenization
Addressed reviews of an open MR

update: [9.01.2023] and [17.03.2023]

Informal review
Started cleaning up issues on GitLab, to make them more verbose
Debugged some pyspark issues related to resource requirements + working with distributed files
Grouping languages by cluster (delimiter + fallback wise)
Started annotating FLORES101 dataset for sentence segmentation task.

update: [24.01.2023]

Updated abbreviation pipeline to consider minimum word frequency
Moved tasks to phabricator to establish hierarchical structure
Built the pipeline for sentencepiece corpus generation

update: [01.02.2023]

Uploaded language wise filtered abbreviations lists with an MR
Trained sentencepiece on sample group of languages
Created new MR for sentencepiece scripts
More gitlab issue-cleanup

update: [07.02.2023]

Adapted to use the abbreviation lists from a pickle file
Minor modifications in the notebook
Implemented a rule-based word tokenization method for whitespace-delimited languages (resources from this paper)
Some issue clean-up