Name: Priyanka Mandikal
Web Page / Blog / Microblog / Portfolio: priyankamandikal.wordpress.com
Location: Goa, India
Typical working hours: 14:00 hrs to 24:00 hrs (IST)
Create a medawiki-utilities bot to find articles in given categories, category trees, and lists. For each such article, find passages with (1) facts and statistics which are likely to have become out of date and have not been updated in a given number of years, and optionally (2) phrases which are likely unclear. Add an indication of the location and the text of those passages either to the page in question using templates, to a bookkeeping page with other page names as headings, and/or to a database local to the bot.
Use a customizable array of keywords and regular expressions and measures of text comprehensibility (or optionally, the DELPH-IN LOGIN parser [ http://erg.delph-in.net/logon ]) to find such passages for review. Use an algorithm at least as good as that in T89763#1066043 to pre-compute the age of each word in an article (to avoid the move and blanking issues described in e.g., http://wikitrust.soe.ucsc.edu/talks-and-papers ) before processing each article of interest.
Present flagged passages to one or more subscribed reviewers. Update the source template, if any, with the reviewer(s)' response, but keep the original text as part of the template. When reviewers disagree, update the template, if any, to reflect that fact, and present the question to a third reviewer to break the tie.
I have completed the following microtasks:
- I have outlined the steps for running the code for authorship and edit histories (WikiwhoRelationships.py) in the following wiki.
- I wrote a program for extracting the revision history of numeric data and stats from wiki pages. The code and output files can be viewed here. https://github.com/priyankamandikal/wiki_accuracy_review/blob/master/revision_history_1.py
Tasks: Document updated schema. Improve candidate passage queue management, reviewer workflow, reviewer reputation database, and reporting. Include double-blinded identity and action codings for reviewer reputation database.
|7/12/15 - 13/12/15
|Explore the code for authorship and edit interactions and use it as an inspiration to extract the revision dates
|14/12/15 – 20/12/15
|Create a data scraper for getting a few sample articles from the XML dump for analysis purposes
|21/12/15 – 3/1/15
|Modify the revision_history_1.py file to pre-compute the age of each word in an article
|4/1/16 – 10/1/16
|Develop an algorithm for extracting out questionable content and controversial statements
|11/1/16 – 17/1/16
|Improve upon the previous algorithm to include extracting student editor's work
|18/1/16 – 24/1/16
|Pull out sentence constructs that have more than 60 words in it
|25/1/16 – 31/1/16
|Integrate all the code written till now to create the candidate passage queue and run all possible test cases
|1/2/16 – 7/2/16
|Present flagged passages to one or more subscribed reviewers and update the template based on their response
|8/2/16 – 14/2/16
|Introduce a third reviewer in case of a tie and verify with the reviewer-reputation database. Send reviewed results to the Wikipedia editor community
|16/2/16 – 24/2/16
|25/2/16 – 28/2/16
|Run a final test to catch bugs and fix them
|29/2/16 – 7/3/16
|Document all the work that has been done and complete the project
I will be in constant contact with my mentors and will ask for help when needed. I will also remain active on the IRC channels. I have created a personal blog which I'll constantly update to reflect my progress.
University: BITS Pilani, India
Major: Computer Science (BE) and Physics(Int. Msc)
Degree level: 4th year undergrad
Graduation year: 2017
Are you eligible?: Yes
Other commitments: I will be completely free during winter break from December to the end of January. Regarding coursework, I will be taking 12 credits (maximum a student can take is 30). So it will take around 12 hours a week and I will dedicate 50 hours per week on the Outreachy internship without fail.
How I heard about this program: From past interns from my university
Why is this project important to me: I am currently studying a data mining course in college and I have been deeply interested in extracting information from data. This led me to undertake an online course on Data Science (CS 109) offered by Harvard University and I am absolutely loving it. I hope to apply whatever I learn from these courses in my current project and also learn more things along the way. Moreover, I have my eyes on pursuing higher education in related areas, which makes this project so much more important to me, not only as an internship experience, but as a learning process itself. I hope to give and take back a lot through this project.
Please describe your experience with any other FOSS projects as a user and as a contributor:
I am a user of the Linux Operating System and I absolutely love it. I admire the flexibility that it has to offer its users. And of course, Wikipedia itself is a great inspiration for venturing into OSS. I have contributed to it as a part of this application itself and hope to contribute a lot more through the internship itself.
Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them (Provide Links):
- I built a data-scraper and explorer in Python for the IMDb top 10000 movies and explored some interesting properties of the obtained data based on ratings, genres, release year, etc.
- I have completed an R programming course and an Exploratory data analysis course offered by Johns Hopkins University on Coursera. My certificates can be viewed from the following links:
- I am doing the CS 109 online course in Data Science offered by Harvard University. I am learning a lot through this course with respect to cleaning up data, various data analysis libraries, machine learning and data visualization.
- I am currently using machine learning algorithms to predict protein-protein interaction sites. I used Logistic Regression, Random Forest Classifier, Ada Boost and SVMs and applied GridSearch on them to find the best classifier based on the MCC scores.
- I am also working on a data mining project that predicts the Click-Through-Rate (CTR) of users who view advertisements on a range of mobile phones via various apps. I 'll be implementing the decision-tree and clustering algorithms in Hadoop using the Map-Reduce paradigm.