The Wikimedia Foundation Research team has been working on a project titled Identification of Unsourced Statements. The research project aimed to create a "Citation Needed" machine learning model able to automatically classify statements on Wikipedia as requiring a citation. We would now like to use this model to provide data dumps to the Wikimedia community for use in developing tools, bots, and other systems for improving the encyclopedia’s reliability.
On the English Wikipedia more than 300,000 statements have been manually flagged as requiring a citation, but many more are left untagged. These tags help the community to identify areas of the project requiring improvement, and signal to readers that information should be independently verified. Wikimedia tools such as Citation Hunt use these Citation Needed tags to provide microtasks to users interested in improving Wikipedia’s reliability, highlighting these unsourced statements and asking users to find a relevant citation. Releasing public, updated data about which sentences need citations in Wikipedia can be incredibly useful to augment the potential of such tools, as well as foster the creation of new community tools that leverage micro-tasks to improve Wikipedia’s reliability. This data can be further enriched by discovering more sentences missing citations using the machine learning model developed by the Research team.
The “Citation Needed” model takes as input a sentence from a Wikipedia article and its section title, and gives as output a “citation needed” score reflecting whether the sentence should have a citation or not. The task of this internship is to scale this process up. The candidate will design an end-to-end model that automatically parses the periodic Wikipedia XML dumps to extract unsourced sentences and their section titles, classify these sentences using the machine learning model models to detect which of those are actually missing citations, and release the output in form of periodic data dumps.
After creating a system capable of generating periodic data dumps, further work could include writing documentation, publicising the data, and supporting integration into tools like Citation Hunt.