Page MenuHomePhabricator

NLP Tools for Content Gaps
Closed, ResolvedPublic



  • Researchers could start with a Wikipedia article (wikitext or HTML), strip syntax to leave just paragraphs of plaintext, and then further tokenize these sentences into sentences and words for input into models.
  • This would be language-agnostic – i.e. the library would work equally well regardless of Wikipedia language.
  • Each component is a Python library that is easily configurable but provides good default performance out-of-the-box

Current state

  • We have a strong start on both wikitext -> plaintext (edit types library) and HTML -> plaintext (HTML dumps library). Both can see continued improvement / streamlining but provide the basic functionality.
  • We have explored various approaches for sentence/word tokenization but our code is scattered across projects, still has known gaps, and has not been well-tested in many languages.


  • Structured Tasks
    • Copy-edit: need to split article into individual sentences to feed into model.
    • Add-a-link: split articles into sentence to capture appropriate context for each word in the model. Split sentences into words to know which tokens to evaluate for links.
    • Citation-needed: need to split article into individual sentences to feed into citation-needed model.
  • Metrics / Analysis
    • Edit types: summarize what was changed by an edit on Wikipedia – e.g., # of sentences/words added.
    • Readability: extract sentences to identify the number of entities per sentence as a proxy for readability.
    • Quality model: number of sentences as a better proxy for amount of content than bytes.
  • Extraction
    • TextExtracts Extension: return first k sentences in an article.
    • HTML Dumps: extract plaintext from article HTML – eventually might be nice to feed output into sentence tokenizer for even more control.
    • Vandalism detection: feature generation would benefit from word tokenization and likely sentence tokenization as well.

Event Timeline

Known challenges:

  • Sentence tokenization:
    • Fully language-inclusive list of sentence-ending punctuation? Can we check each language edition to see if any seem to have very long sentences or punctuation characters that show up commonly that we're not capturing?
    • Abbreviations: can we devise a means of automatically building lists of common abbreviations in a language?
  • Word tokenization:
    • Strategy for languages that don't use whitespace to separate words: can we train standard tokenizers – e.g., sentencepiece – that outperform bigrams without much additional overhead? Perhaps use Mediawiki fallback languages as language families?
  • Evaluation:
    • Can we build a good starter test suite for tokenization?
    • Are there other automated ways we can test the quality of our tokenization – e.g., add-a-link performance?
Isaac moved this task from Backlog to FY2022-23-Research-July-September on the Research board.

Claiming for now until I can pass off to NT

Fleshed out known challenges list and set plan for work:

  1. Setup
    • Gitlab repo
    • License, basic README / CI, basic folder structure
  2. Sentence Tokenization
    • Build list of sentence-ending punctuation (full stops)
    • Account for exceptions – e.g., abbreviations, decimal points, etc.
  3. Word Tokenization
    • Strategy for whitespace-delimited languages
    • Strategy for non-whitespace-delimited languages
  4. Evaluation
    • Unit tests
    • Automatic evaluation
  5. Wikitext/HTML -> Plaintext
    • HTML Dumps
  6. Miscellanea
    • Stemming?
    • Demonstrating use-cases for library
    • Blogpost / sharing out with Research community and inviting feedback
    • Emergent items

Kick-off meeting the week of Sept 26!

update : [30.09.2022]

  1. Setup the basic project structure on gitlab. PR#2
  2. Started analysis on pre-established NLP packages (NLTK, Gensim, Spacy) PR#3

update: [07.10.2022]

  1. Building a report on Sentence Tokenization link
  2. Renewed focus on memory-footprint and compute- cost

update: [14.10.2022]

  1. Finished the review on standard tools of Sentence Tokenization doc
  2. Informal Report on Milestones link
  3. Collecting resources for rule-based segmentation
  4. Isaac created a meta page

update: [18.10.2022] and [25.10.2022]

  1. compiled list of unicode sentence terminators
  2. built a benchmark sample for four langauges ( EN ES DE AR)
  3. implemented the naive rule-based sentence segmenter
  4. collected dataset for testing and future supervised training

update: [04. 11.2022]

  1. Server Onboarding
  2. Building a deterministic Benchmark Module and dataset development
  3. Going through the example pySpark notebook by Martin and other walkthrough documentations by Isaac


  1. Curated list of abbreviations for all languages with a wiktionary project.
  2. Working on integrating the abbreviation search as a replacement scheme.

update: [18.11.2022] and [27.11.2022]

  1. Implemented abbreviation replacement scheme
  2. performance analysis of segmentation before and after abbreviation post-processing.
  3. Implement a filtration scheme for the wiktionary abbreviations
  4. Performance analysis of abbreviation filtration, across a range of frequency ratio threshold

update: [06.12.2022]

  1. Addressed reviews on the abbreviation filtering scheme
  2. discussion on sentence segmentation evaluation datasets
  3. wrote algorithm for abbreviation filtering

update: [13.12.2022]

  1. Adapted Martin's code on wikitext processing for abbreviation filtering
  2. Moved to statmachine for running simulations using pyspark
  3. Started literature review + background study on Word Tokenization

update: [20.12.2022]

  1. Mostly spent time trying to get familiarized with pyspark
  2. Updated the Word tokenization literature review with more information on existing opensource tools
  3. Discovered some additional edge-cases on sentence tokenizations (e.g: parenthesis and quotation tracking)

update: [3.01.2023]

  1. Finished adapting abbreviation filtering code for each wiki project
  2. Got started on word tokenization
  3. Addressed reviews of an open MR

update: [9.01.2023] and [17.03.2023]

  1. Informal review
  2. Started cleaning up issues on GitLab, to make them more verbose
  3. Debugged some pyspark issues related to resource requirements + working with distributed files
  4. Grouping languages by cluster (delimiter + fallback wise)
  5. Started annotating FLORES101 dataset for sentence segmentation task.

update: [24.01.2023]

  1. Updated abbreviation pipeline to consider minimum word frequency
  2. Moved tasks to phabricator to establish hierarchical structure
  3. Built the pipeline for sentencepiece corpus generation

update: [01.02.2023]

  1. Uploaded language wise filtered abbreviations lists with an MR
  2. Trained sentencepiece on sample group of languages
  3. Created new MR for sentencepiece scripts
  4. More gitlab issue-cleanup

update: [07.02.2023]

  1. Adapted to use the abbreviation lists from a pickle file
  2. Minor modifications in the notebook
  3. Implemented a rule-based word tokenization method for whitespace-delimited languages (resources from this paper)
  4. Some issue clean-up

update: [15.02.2023]

  1. Using a JSON of the abbreviations files
  2. Implemented character-level word tokenization scheme
  3. Minor CI/CD reconfiguration

update: [21.02.2023] - [28.02.2023]

  1. Restructured tokenizer class
  2. Addressed reviews on optimizing the word-tokenization schemes
  3. Reorganized

update: [07.03.2023]

  1. Identified newer issues stemming from the tokenization class implementations
  2. Set-up repo for packagin/easy installation
  3. Added testing modules

update: [14.03.2023]

  1. Merged remaining MRs on packaging and code restructuring.
  2. Started working on NWS word tokenization with sentencepiece
  3. Corpus collection for SPC tokenizer and training

update: [21.03.2023]

  1. Added MR on SPC integration with training and corpus collection scripts
  2. Adopted the test suit for the new repo structure
  3. Updated the ground truth dataset format for evaluation

update: [28.03.2023]

  1. Added test modules for NWS sentence tokenizations.
  2. Integrated abbreviations in SPC tokenization
  3. Defined the unsupervised word tokenization performance evaluation scheme

update: [03.04.2023] - [11.04.2023]

  1. Informal quarterly review
  2. Annotated sentence evaluation dataset for Bangla
  3. Addressed reviews on earlier MRs
  4. Identified some more issues

update: [18.04.2023]

  1. Pushed MR on NWS word tokenization evaluation
  2. Calculated BN-EN split alignment
  3. Working on upgrading spark version

update : [25.04.2023]

  1. Addressed reviews on the MR
  2. Additional alignment stats for BN-EN splits
  3. Upgraded existing notebooks to spark3

update: [02.05.2023]

  1. 3-way alignment between BN/EN/DE splits
  2. Script writing on word tokenization wikipedia benchmark dataset building

update: [09.05.2023]

  1. Addressed reviews on current MR
  2. Created new benchmarking dataset for sentence tokenization evaluation
  3. Scripts on wikilink parsing

update: [16.05.2023]

  1. Completed MR on Word Tokenization Evaluation dataset generation
  2. FLORES language name alignment sheet

update: [23.05.2023]

  1. Word tokenization benchmarking error logging integration
  2. Fixed stat-machines related errors + upgraded notebooks
  3. Identified newer edge-cases for sentence tokenization

update: [30.05.2023]

  1. Sentence Evaluation Dataset expansion
  2. Fixed recall greater than 1 error
  3. Clustered NWS language scripts

update: [06.06.2023]

  1. Shifted to paragraph-level dataset from sentence-level due to bad segmentation error
  2. Trained individual + combined + clusterwise SPC models

update: [13.06.2023]

  1. Benchmarking output and log analysis
  2. Closed remaining MRs
  3. Discussing packaging decisions

update: [21.06.2023]

  1. Exploratory notebook generation
  2. Test PyPI uploads
  3. Listed newer low priority issues

update: [26.06.2023 - 27.06.2023]

  1. Final presentation preparations, feedback and discussions
  2. Packaged released in PyPI
  3. Wrapped up with a presentation with research team!