A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Samwalton9-WMF
	Sep 24 2019, 11:17 AM

Description

The Wikimedia Foundation Research team has been working on a project titled Identification of Unsourced Statements. The research project aimed to create a "Citation Needed" machine learning model able to automatically classify statements on Wikipedia as requiring a citation. We would now like to use this model to provide data dumps to the Wikimedia community for use in developing tools, bots, and other systems for improving the encyclopedia’s reliability.

On the English Wikipedia more than 300,000 statements have been manually flagged as requiring a citation, but many more are left untagged. These tags help the community to identify areas of the project requiring improvement, and signal to readers that information should be independently verified. Wikimedia tools such as Citation Hunt use these Citation Needed tags to provide microtasks to users interested in improving Wikipedia’s reliability, highlighting these unsourced statements and asking users to find a relevant citation. Releasing public, updated data about which sentences need citations in Wikipedia can be incredibly useful to augment the potential of such tools, as well as foster the creation of new community tools that leverage micro-tasks to improve Wikipedia’s reliability. This data can be further enriched by discovering more sentences missing citations using the machine learning model developed by the Research team.

The “Citation Needed” model takes as input a sentence from a Wikipedia article and its section title, and gives as output a “citation needed” score reflecting whether the sentence should have a citation or not. The task of this internship is to scale this process up. The candidate will design an end-to-end model that automatically parses the periodic Wikipedia XML dumps to extract unsourced sentences and their section titles, classify these sentences using the machine learning model models to detect which of those are actually missing citations, and release the output in form of periodic data dumps.

After creating a system capable of generating periodic data dumps, further work could include writing documentation, publicising the data, and supporting integration into tools like Citation Hunt.

Mentors

@Miriam
@Samwalton9
@Surlycyborg / eggpi on GitHub / ggp on IRC

Links

Related Objects
Search...

Status	Assigned	Task
Declined	None	T199190 [2.4] Improve unsourced statement identification tools and algorithms
Resolved	AikoChou	T233707 A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia
Resolved	None	T233709 Onboarding Task: getting familiar with the machine learning models for Citation Need
Resolved	None	T234519 Your first task: classify sample statements using Citation Needed Models
Resolved	None	T234606 Your second task: classify statements within an article
Resolved	AikoChou	T241518 Outreachy Proposal: A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia

Event Timeline

Samwalton9-WMF created this task.Sep 24 2019, 11:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2019, 11:17 AM

Samwalton9-WMF updated the task description. (Show Details)Sep 24 2019, 11:18 AM

Astinson awarded a token.Sep 24 2019, 2:00 PM

srishakatux moved this task from Backlog to Featured projects on the Outreachy (Round 19) board.Sep 24 2019, 6:42 PM

(we will make this project public on Oct 1 as per Outreachy's guidelines)

(contribution period is now open)

ArielGlenn subscribed.Oct 2 2019, 6:52 AM

Pgadige01 subscribed.Oct 2 2019, 7:15 AM

AikoChou subscribed.Oct 2 2019, 10:08 AM

Lucideuclid subscribed.Oct 2 2019, 7:50 PM

Meloju subscribed.Oct 2 2019, 11:33 PM

Achillesheel02 subscribed.Oct 3 2019, 10:22 AM

Kendallcorner subscribed.Oct 3 2019, 3:24 PM

leila added a project: Research.Oct 3 2019, 8:18 PM

Hello, just testing

Surlycyborg updated the task description. (Show Details)Oct 4 2019, 10:06 AM

MehakJ2106 subscribed.Oct 5 2019, 6:06 AM

YemiKifouly subscribed.Oct 5 2019, 9:18 AM

KalindiFonda subscribed.Oct 6 2019, 1:20 PM

Unit-ade subscribed.Oct 10 2019, 2:37 AM

Hi, I am Jyoti from India. I am a BCA student. I am new to Open source world, I am participating in Outreachy. I have keen interest in machine learning and Natural Language Processing. Also, I have good skills with python. Please tell me can I contribute to this machine learning project for Wikimedia? Thanks, Jyoti

Hi @jyox007 :) You can find the tasks to get involved in this project listed above, below the task description.

ArielGlenn added a project: User-ArielGlenn.Oct 14 2019, 10:11 AM

Ferculell subscribed.Oct 15 2019, 2:49 AM

Please note that the Outreachy deadline is November 5th! We've not had many people finalise their application yet so please make sure you do that if you want to be considered!

You can record your contribution and make a final application at https://www.outreachy.org/december-2019-to-march-2020-internship-round/communities/wikimedia/a-system-for-releasing-data-dumps-from-a-classifie/contributions/

Final reminder! The application deadline is in ~6 hours.

Samwalton9-WMF closed subtask T233709: Onboarding Task: getting familiar with the machine learning models for Citation Need as Resolved.Nov 6 2019, 9:46 AM

Samwalton9-WMF added a parent task: T199190: [2.4] Improve unsourced statement identification tools and algorithms.Nov 8 2019, 9:24 AM

@Miriam I have temporarily assigned this task to you so we know on our end who owns the task. Please feel free to reassign to the right person and/or take any other steps needed.

AikoChou mentioned this in T241518: Outreachy Proposal: A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia.Dec 28 2019, 8:20 PM

Aklapper added a subtask: T241518: Outreachy Proposal: A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia.Dec 29 2019, 5:10 PM

Miriam mentioned this in T228442: Design and implement an API for "citation needed" tag recommendation.Jan 2 2020, 3:08 PM

Miriam reassigned this task from Miriam to AikoChou.Jan 2 2020, 3:16 PM

Miriam moved this task from Backlog to In Progress on the Research board.

Kendallcorner unsubscribed.Jan 2 2020, 3:28 PM

Miriam mentioned this in T242601: Supervise Outreachy Internship on Releasing data dumps for Citation Needed Classifiers.Jan 13 2020, 12:56 PM

Weekly update