A System for releasing periodic data dumps from the citation needed model
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	Unit-ade
	Nov 5 2019, 3:17 PM

Description

Profile Information

Name: Tinuade Adeleke
IRC nickname: Unit-Ade
Web Profile: about.me/tinuade
Resume (optional)
Location: Nigeria
Typical working hours (9:00am - 5:00pm GMT+1)

Synopsis

Wikimedia tools such as Citation Hunt use Citation Needed tags to provide microtasks to users interested in improving Wikipedia’s reliability, highlighting these unsourced statements and asking users to find a relevant citation. Releasing public, updated data about which sentences need citations in Wikipedia can be incredibly useful to augment the potential of such tools, as well as foster the creation of new community tools that leverage micro-tasks to improve Wikipedia’s reliability.This data can be further enriched by discovering more sentences missing citations using the machine learning model developed by the Research team.
The “Citation Needed” model takes as input a sentence from a Wikipedia article and its section title, and gives as output a “citation needed” score reflecting whether the sentence should have a citation or not

We want to scale this process up by designing an end-to-end model that automatically parses the contents of Wikipedia articles to extract unsourced sentences and their section titles, classify these sentences using the machine learning model to detect which of those are actually missing citations, and release the output in form of periodic data dumps.

The Media Wiki Api is called periodically by a cronjob to fetch the articles from Wikipedia data dumps, a parser then extract sentences in the article along side the section title. It also checks for unsourced sentences. The result of this stage is then passed to the Citation Needed model for classification Each time it runs, the results of the sentences that need citation are then saved as a data dump. As a stretch goal we can also have API's that allows this data to accessed

@Miriam @samwalton

Timeline

Dec 3 - Dec 7 : Community bonding period. Communicating with mentors on refining your project proposal, finalizing deadlines and setting milestones. Studying existing tools.Planning the design of the tool. Adding and structuring the corresponding tasks in
Phabricator.
Dec 8 - Dec 15: Work with Media Wiki API to access articles and get a clear understanding of the API
Dec 15 - Dec 22: Start work on the article parser. Be able to extract out section titles from articles
Dec 23 - Dec 29: Continue work on the parser, identify unsourced sentences in sections
Dec 30 - Dec 5: Integrate parser with the Citation Needed model for testing
Jan 5 - Jan 12: Work on storing the result from citation needed model as data dumps
Jan 13 - Jan 26: Develop Unit tests
Jan 27 - Feb 10: Evaluate performance and make modification to the system. Improvements based on the feedback responses received and find and document useful features that can be added
Feb 11 - Mar 3: Bug fixes, Writing documentation and Updating appropriate guides. Code cleanup for submission.

Participation

Work on a separate branch on git and uploading code to the forked repository almost on a daily basis. Creating pull requests as and when a complete feature is done.

Online on IRC in my working hours ( 9am to 5pm GMT+1) to collaborate with my mentors

Communication on tasks will be through commenting on subtasks to the project created on Phabricator.

Weekly reports will be published in my meta wiki user page

Publishing on my blog the summary of a task at the end of a task period as above in the timeline

Keeping an open mind to learn and achieve the best results

I also hope to create a community for wikimedia foundation here in my sphere of influence. Where people can learn and contribute to open source.

About Me

I'm A fresh Graduate of the Obafemi Awolowo University Ile-Ife. Heard about this program from a friend. During the duration of the program Outreachy would be my first priority since I won't have any other commitments.

Contributing to open source can be a rewarding way to learn, teach, and build experience in just about any skill you can imagine. I want that one opportunity that gives me several other opportunities to Improve on existing skills, Interact with a greater perspective, meet and work with people interested in similar things, learn people skills, build public artifacts and grow a reputation amongst many others.

More Interestingly, I would be contributing in my own little way something that would make the world's largest free encyclopedia, richer in content.

Past Experience

I have had experience with python, chat bots, API's and databases(SQL and NoSQL) while working on the following projects

Facebook Tour Chatbot https://github.com/tinumide/Facebook-bot
A chatbot that helps find tourist centres in your location

Facebook Quotebot https://github.com/tinumide/Quote_bot
A chatbot that helps find quotes based on a particular subject e.g Quotes on love

Inventedu https://invent-edu.herokuapp.com/
Invent Edu is a skill-sharing web application where learners meet with tutors. I wrote the Back end of this web application using Flask

TwitterCloud https://github.com/tinumide/TwiterCloud
A python package that creates a word cloud from your twitter timeline

InventOne inventone.ng
I interned at inventone where I worked for improving the features of a proprietary software as well.

I love to participate in hackathons and have participated in a few over the years where I demonstrated critical thinking
IEEE Extreme 2019
Microsoft Leap Hackathon 2019
NIBSS Hackathon 2018
Girls with Grits 2017
During the contribution period with wikimedia on the project topic "A system for releasing data dumps from a classifier detecting unsourced sentences in Wikipedia" Three tasks were given to be completed during
Open None T233709 Onboarding Task: I got familiar with the machine learning models for Citation Need
I read the documentation about the Research Project, and become familiar with the codebase for the machine learning models , as well as with basic notions and functions of the Keras library for Python
Open None T234519 Your first task:
I classified sample statements using Citation Needed Models
Open None T234606 Your second task:

In this task, I exercised simple parsing of a Wikipedia article and classifying some of its sentences.

I wrote a script in python that

1- Receives as input the title of a English Wikipedia article.
2- Retrieves the text of that article from the MediaWiki API. If using Python, consider using python-mwapi for this.
3- Identifies individual sentences within that text, along with the corresponding section titles. If using Python, mwparserfromhell can help you work with wiki markup.
4- Runs those sentences through the model to classify them.
5- Outputs the sentences, one per line, sorted by score given by the model.

https://github.com/tinumide/wikimedia_task_2

Any Other Info

Add any other relevant information such as UI mockups, references to related projects, a link to your proof of concept code, etc

Related Objects

Mentioned Here: T233709: Onboarding Task: getting familiar with the machine learning models for Citation Need
T234519: Your first task: classify sample statements using Citation Needed Models
T234606: Your second task: classify statements within an article

Event Timeline

Unit-ade created this task.Nov 5 2019, 3:17 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 5 2019, 3:17 PM

Unit-ade updated the task description. (Show Details)Nov 5 2019, 3:38 PM

Unit-ade updated the task description. (Show Details)Nov 5 2019, 3:40 PM

Unit-ade updated the task description. (Show Details)

Unit-ade updated the task description. (Show Details)Nov 5 2019, 3:43 PM

Assuming that this is related to Outreachy.

Thanks for creating a proposal! As we are past the deadline, if you would like us to consider your proposal for review, please move it to the submitted column. Thank you!

Thank you @srishakatux . I'm not sure of how to move it to the submitted column. Please kindly direct me

@Unit-ade: Feel free to use the Add Action... → Move on Workboard dropdown above the field to add a new comment. Thanks!

Kiranofans moved this task from Backlog to Proposals Accepted on the Outreachy (Round 19) board.Nov 6 2019, 6:25 PM

Kiranofans moved this task from Proposals Accepted to Backlog on the Outreachy (Round 19) board.Nov 7 2019, 5:46 AM

(this proposal was not accepted)

srishakatux closed this task as Declined.Dec 18 2019, 12:29 AM

A System for releasing periodic data dumps from the citation needed modelClosed, DeclinedPublicActions