Page MenuHomePhabricator

Outreachy Proposal for Accuracy Review of Wikipedias
Closed, InvalidPublic



Name: Priyanka Mandikal
IRC: prnk28
Web Page / Blog / Microblog / Portfolio:
Location: Goa, India
Typical working hours: 14:00 hrs to 24:00 hrs (IST)


Create a medawiki-utilities bot to find articles in given categories, category trees, and lists. For each such article, find passages with (1) facts and statistics which are likely to have become out of date and have not been updated in a given number of years, and optionally (2) phrases which are likely unclear. Add an indication of the location and the text of those passages either to the page in question using templates, to a bookkeeping page with other page names as headings, and/or to a database local to the bot.

Use a customizable array of keywords and regular expressions and measures of text comprehensibility (or optionally, the DELPH-IN LOGIN parser [ ]) to find such passages for review. Use an algorithm at least as good as that in T89763#1066043 to pre-compute the age of each word in an article (to avoid the move and blanking issues described in e.g., ) before processing each article of interest.

Present flagged passages to one or more subscribed reviewers. Update the source template, if any, with the reviewer(s)' response, but keep the original text as part of the template. When reviewers disagree, update the template, if any, to reflect that fact, and present the question to a third reviewer to break the tie.

Wiki page:

Primary mentor: @Jsalsman
Co-mentors: @Maribelacosta and @FaFlo

I have completed the following microtasks:

  1. I have outlined the steps for running the code for authorship and edit histories ( in the following wiki.

wiki:Accuracy Review

  1. I wrote a program for extracting the revision history of numeric data and stats from wiki pages. The code and output files can be viewed here.

Tasks: Document updated schema. Improve candidate passage queue management, reviewer workflow, reviewer reputation database, and reporting. Include double-blinded identity and action codings for reviewer reputation database.


7/12/15 - 13/12/15Explore the code for authorship and edit interactions and use it as an inspiration to extract the revision dates
14/12/15 – 20/12/15Create a data scraper for getting a few sample articles from the XML dump for analysis purposes
21/12/15 – 3/1/15Modify the file to pre-compute the age of each word in an article
4/1/16 – 10/1/16Develop an algorithm for extracting out questionable content and controversial statements
11/1/16 – 17/1/16Improve upon the previous algorithm to include extracting student editor's work
18/1/16 – 24/1/16Pull out sentence constructs that have more than 60 words in it
25/1/16 – 31/1/16Integrate all the code written till now to create the candidate passage queue and run all possible test cases
1/2/16 – 7/2/16Present flagged passages to one or more subscribed reviewers and update the template based on their response
8/2/16 – 14/2/16Introduce a third reviewer in case of a tie and verify with the reviewer-reputation database. Send reviewed results to the Wikipedia editor community
16/2/16 – 24/2/16College Exams
25/2/16 – 28/2/16Run a final test to catch bugs and fix them
29/2/16 – 7/3/16Document all the work that has been done and complete the project

I will be in constant contact with my mentors and will ask for help when needed. I will also remain active on the IRC channels. I have created a personal blog which I'll constantly update to reflect my progress.

About me

University: BITS Pilani, India
Major: Computer Science (BE) and Physics(Int. Msc)
Degree level: 4th year undergrad
Graduation year: 2017

Are you eligible?: Yes

Other commitments: I will be completely free during winter break from December to the end of January. Regarding coursework, I will be taking 12 credits (maximum a student can take is 30). So it will take around 12 hours a week and I will dedicate 50 hours per week on the Outreachy internship without fail.

How I heard about this program: From past interns from my university

Why is this project important to me: I am currently studying a data mining course in college and I have been deeply interested in extracting information from data. This led me to undertake an online course on Data Science (CS 109) offered by Harvard University and I am absolutely loving it. I hope to apply whatever I learn from these courses in my current project and also learn more things along the way. Moreover, I have my eyes on pursuing higher education in related areas, which makes this project so much more important to me, not only as an internship experience, but as a learning process itself. I hope to give and take back a lot through this project.

Past Experience

Please describe your experience with any other FOSS projects as a user and as a contributor:
I am a user of the Linux Operating System and I absolutely love it. I admire the flexibility that it has to offer its users. And of course, Wikipedia itself is a great inspiration for venturing into OSS. I have contributed to it as a part of this application itself and hope to contribute a lot more through the internship itself.

Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them (Provide Links):

  1. I built a data-scraper and explorer in Python for the IMDb top 10000 movies and explored some interesting properties of the obtained data based on ratings, genres, release year, etc.

  1. I have completed an R programming course and an Exploratory data analysis course offered by Johns Hopkins University on Coursera. My certificates can be viewed from the following links:

  1. I am doing the CS 109 online course in Data Science offered by Harvard University. I am learning a lot through this course with respect to cleaning up data, various data analysis libraries, machine learning and data visualization.
  2. I am currently using machine learning algorithms to predict protein-protein interaction sites. I used Logistic Regression, Random Forest Classifier, Ada Boost and SVMs and applied GridSearch on them to find the best classifier based on the MCC scores.
  3. I am also working on a data mining project that predicts the Click-Through-Rate (CTR) of users who view advertisements on a range of mobile phones via various apps. I 'll be implementing the decision-tree and clustering algorithms in Hadoop using the Map-Reduce paradigm.

Event Timeline

prnk28 claimed this task.
prnk28 raised the priority of this task from to High.
prnk28 updated the task description. (Show Details)
prnk28 added subscribers: prnk28, Jsalsman, Maribelacosta and 3 others.
prnk28 lowered the priority of this task from High to Medium.Nov 7 2015, 9:11 PM
prnk28 set Security to None.

This task was created on Nov 07. The application deadline was on Nov 02. Could you elaborate?

I have spoken to the organizers and they are ready to extend the deadline.

Hi @prnk28, as @Aklapper says, the deadline for Outreachy submissions was November 2. This deadline is established by the Outreachy program and is common to all participant organizations.

As far as I'm aware, we didn't hear from you before the deadline. Accepting your proposal now would imply a lack of recognition and respect to the effort put by the rest of candidates, who have submitted their proposals on time (and some of them are applying to the same project idea and mentors than you). We cannot make this exception, and I hope you understand.

Wikimedia has many venues to get involved as a volunteer contributor today. We are also participating in every Outreachy and Google Summer of Code round. We encourage you to get involved now in order to become a strong candidate for these programs at their next round in a few months.

I understand the responsibility organizers have towards the program and its mission. As I had conveyed to you earlier, I have decided to go ahead with this project without Outreachy.