Page MenuHomePhabricator

A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia
Closed, ResolvedPublic


The Wikimedia Foundation Research team has been working on a project titled Identification of Unsourced Statements. The research project aimed to create a "Citation Needed" machine learning model able to automatically classify statements on Wikipedia as requiring a citation. We would now like to use this model to provide data dumps to the Wikimedia community for use in developing tools, bots, and other systems for improving the encyclopedia’s reliability.

On the English Wikipedia more than 300,000 statements have been manually flagged as requiring a citation, but many more are left untagged. These tags help the community to identify areas of the project requiring improvement, and signal to readers that information should be independently verified. Wikimedia tools such as Citation Hunt use these Citation Needed tags to provide microtasks to users interested in improving Wikipedia’s reliability, highlighting these unsourced statements and asking users to find a relevant citation. Releasing public, updated data about which sentences need citations in Wikipedia can be incredibly useful to augment the potential of such tools, as well as foster the creation of new community tools that leverage micro-tasks to improve Wikipedia’s reliability. This data can be further enriched by discovering more sentences missing citations using the machine learning model developed by the Research team.

The “Citation Needed” model takes as input a sentence from a Wikipedia article and its section title, and gives as output a “citation needed” score reflecting whether the sentence should have a citation or not. The task of this internship is to scale this process up. The candidate will design an end-to-end model that automatically parses the periodic Wikipedia XML dumps to extract unsourced sentences and their section titles, classify these sentences using the machine learning model models to detect which of those are actually missing citations, and release the output in form of periodic data dumps.

After creating a system capable of generating periodic data dumps, further work could include writing documentation, publicising the data, and supporting integration into tools like Citation Hunt.


@Surlycyborg / eggpi on GitHub / ggp on IRC


Event Timeline

srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".Sep 26 2019, 11:11 PM
srishakatux changed the edit policy from "All Users" to "acl*outreachy-mentors (Project)".
srishakatux subscribed.

(we will make this project public on Oct 1 as per Outreachy's guidelines)

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Oct 1 2019, 5:58 PM
srishakatux changed the edit policy from "acl*outreachy-mentors (Project)" to "All Users".

(contribution period is now open)

Hi, I am Jyoti from India. I am a BCA student. I am new to Open source world, I am participating in Outreachy. I have keen interest in machine learning and Natural Language Processing. Also, I have good skills with python. Please tell me can I contribute to this machine learning project for Wikimedia? Thanks, Jyoti

Hi @jyox007 :) You can find the tasks to get involved in this project listed above, below the task description.

Please note that the Outreachy deadline is November 5th! We've not had many people finalise their application yet so please make sure you do that if you want to be considered!

You can record your contribution and make a final application at

Final reminder! The application deadline is in ~6 hours.

leila subscribed.

@Miriam I have temporarily assigned this task to you so we know on our end who owns the task. Please feel free to reassign to the right person and/or take any other steps needed.

Miriam moved this task from Backlog to In Progress on the Research board.

Weekly update

  • Modified the input pipeline and the format written to the database.
  • Worked on a script to ingest data into Citation Hunt.
  • Created pull request of 1 and 2 for Guilherme to review

Week 1-8 Summary

For Citation Detective

  • A design specification for the project
  • First script & PR merged to the Github repository
  • Ran Citation Detective on ~9k articles
  • Modified input pipelines and the format written to the database

For Citation Hunt

  • Set up a local Citation Hunt and study the workflows
  • Worked on scripts to ingest data into Citation Hunt
  • Added simple UI to highlight sentences detected lacking citations

Week 9-10

  • Refined code in aiko-citationhunt
  • Added unit tests to aiko-citationhunt
  • Ran Citation Detective on ~60k articles

Week 11-12

  • Resolved data quality issues (broken sentences, etc.)
  • Worked on multiprocess version of Citation Detective
  • Implemented article filtering/ran on 100k articles

In the last week of the internship, I've been working on:

  • Create a crontab to update the database periodically
  • Write an announcement email
  • Documentation: Meta page, README on GitHub
  • Update aiko-citationhunt to use the final, public citationdetective_p