Weekly progress made on the GSoC project T129536 - Accuracy Review of Wikipedia
= Community Bonding Period: April 22 - May 22 =
=== Report ===
- Detailed Bonding Period report can be viewed here T135665
- Discussed the project plan with the mentor
- Initiated Village Pump and Administrator's Notice board discussions for finding out appropriate backlog categories that can be implemented by our flagging algorithms
- Began taking the course [[ http://cs224d.stanford.edu/ | CS224d ]] - Stanford lectures on 'Deep Learning for Natural Language Processing' for help with implementing a neural net
- Went over chapters 9 and 10 from the book [[ http://flaskbook.com/ | Flask Web Development ]] by Miguel Grinberg (I had completed the first 8 chapters during the application period)
- Held a meeting with the mentor on Google Hangouts
=== Meeting 1: 10th May 2016 - UTC 5pm ===
The following was the agenda of the meeting. Details can be viewed in the 'Meeting Minutes' section of T135665
# Architecture of a neural net we may want to implement
# Wikipedia Backlogs
# Citation Hunt
# Top 3 flagging concerns that we want to address
# Administrative tasks for GSoC
***
= Week 1: May 23 - May 29 =
=== Report ===
- T136290 - Refactored the wikiwho api code to work from the shell rather than the browser. It takes in an article's name and outputs the revision ids in JSON format.
- T136292 - Defined roles (reviewer and admin) for the usage of the app. Added necessary permissions for various actions such as comment, review and administer. Still need to fix errors arising during db migration.
- Held a meeting with the mentor on Google Hangouts
=== Meeting 2: 28th May 2016 - UTC 4am ===
The following was the agenda of the meeting. Details can be viewed in T136536
# Useful tools during the coding phase
# Generating article dumps
# Creating user roles and profile pages
# Clarity on the blueprint version of the bot and project schedule
***
= Weeks 2 & 3: May 30 - June 12 =
=== Report ===
- T136292 - Fixing the db issues arising due to SQLite. Might rename tables as of now. Need a better solution in the future
- T137638 - Created reviewer profiles
- Created a landing page for each reviewer
- Added functionality to view and edit profile
- Admins additionally have permission to edit profiles of other users
- Also added functionality to update their profile picture.
- Held the biweekly meeting with the mentor
- Added the Apache 2.0 license for the app
- Updated the blog post [[ https://priyankamandikal.wordpress.com/wiki-accuracy-review/reviewer-profiles/ | here ]]
=== Meeting 3: 11th June 2016 - UTC 4am ===
The following was the agenda of the meeting. Details can be viewed in T137639
# Discuss and resolve db issues
# Format of the review system
# License for the app
# DataFlowBot
***
= Weeks 4 & 5: June 13 - June 26 =
=== Report ===
- Updated the db schema doc [[ https://docs.google.com/document/d/1qeX6fcrEXPESWYJX_kqkjz07tmguN2PQz-wOapT9DDc/edit?usp=sharing | here ]]
- Went over the vlerm documentation
- Coded the reviewsystem to work from command line. Design of the initial version is [[ https://www.dropbox.com/s/cfavg0v4shapzxp/reviewsystem_initial.pdf?dl=0 | here ]] and code is [[ https://github.com/priyankamandikal/reviewsystem | here ]].
- Held a meeting with the mentor on Google Hangouts
- **Mid-term evaluation**
=== Meeting 4: 26th June 2016 - UTC 1 pm ===
The following was the agenda of the meeting.
# [[ https://github.com/jsalsman/minireview | New version ]] of minireview which makes use of a single directory for file storage
# Performing datetime analysis
# Setting up a labs instance
***
= Week 6: June 27 - July 3 =
- T136290 - Updated the wikiwho [[ https://github.com/priyankamandikal/wikiwho_api/commit/09f6b721fcc9b6f8a6c55c72898c4f38d4fd54da | code ]] to extract dates from revids. Extracted dates were in 2014-05-31T00:14:13Z format
- T138953 - Converted the extracted dates to Julian date format to help in the analysis. Found the mean date of an article. Code is [[ https://www.dropbox.com/s/xm21ykctrrybapi/dates_analysis.py?dl=0 | here ]].
- T139077 - Wrote a [[ https://github.com/priyankamandikal/minireview/blob/master/backlog_script.py | script ]] to obtain all article names along with urls from a given WP:Backlog category. Further, this script creates question files in the 'records' directory for each article in given backlog category and month. The file contents are of the form: <category_name>\n<article_title>\n<url>
- T139078 - Worked on displaying urls/diffs/permalinks in iframes. Code is [[ https://github.com/priyankamandikal/minireview/commit/24c9be8fa6979ec912674762823f6fbfd58b72fe | here ]].
- T139261 - Added url hyperlinking functionality in the texts. Code is [[ https://github.com/priyankamandikal/minireview/commit/24c9be8fa6979ec912674762823f6fbfd58b72fe | here ]].
- Held **Meeting 5** with the mentor on Google Hangouts on 3rd July 2016 at UTC 1 pm
***
= Weeks 7 & 8: July 4 - July 17 =
- Worked on the inspect() function and inspect.html templates to display summary statistics for each file type
- Fixed errors in ratio and recommend.html
- T141295 - Refactored the templates to use Bootstrap. Commit is [[ https://github.com/priyankamandikal/arowf/commit/bfd624269591cc995e88ded93698d235ca4f0663 | here ]]. Need to better the design and style. This will be done later.
- Made wikiwho search for the word ‘recent’ in article dumps (although we switched over to using the search function for this purpose). Defined function [[ https://github.com/priyankamandikal/wikiwho_api/blob/master/Wikiwho_simple.py#L650 | findWord() ]] for this purpose.
- T140559 - Used [[ https://en.wikipedia.org/w/index.php?title=Special:Search&limit=20&offset=0&profile=default&search=recently&searchToken=385h88b6gyqtgysdeov432vx3 | Wikipedia's Special Search ]] to get articles containing the word 'recent' or 'recently'. Then used wikiwho to find the date when these words were inserted. Generated questions with article title, link and date. Code is [[ https://github.com/priyankamandikal/arowf/blob/master/recent_script.py | here ]].
- T140568 - Worked on getting the page views per article for the past month from the [[ https://wikimedia.org/api/rest_v1/?doc#!/Pageviews_data/get_metrics_pageviews | pageviews API ]]. Code is [[ https://github.com/priyankamandikal/arowf/blob/master/pageviews.py | here ]].
- T140560 - Wrote a script for creating questions from low Flesch-Kincaid readability scores and high pageview counts. Almost complete. Fixing a few issues.
- Discussed the action plan for the next stage of the project with the mentor. List of tasks to be completed has been updated and documented.
***
= Weeks 9 & 10: July 18 - July 31 =
- Wrote a new [[ https://priyankamandikal.wordpress.com/wiki-accuracy-review/arowf/ | blog post ]] describing the new system
- T141840 - Began work on the project report- completed a major portion. Hypothesis testing part is remaining.
- Fixed the UnicodeDecode error in /answer
- T140560 - Script to generate questions from poor readability scores. Combined standardized fk-scores with pageviews to get the final score for ranking an article
- T140684 - Tried various methods to handle caching like setting headers, turning off autocomplete, etc. Still needs to be resolved satisfactorily though.
- Work was disrupted for a few days due to bandwidth outages
***
= Week 11: Aug 1 - Aug 7 =
- T141842 - Wrote help docs, added a new endpoint /help, also linked specific help docs from each endpoint
- T142038 - Added custom error pages for errors 404 and 500
- T140785 - Completed the optional registration system
-- Wrote the register() function and register.html templates. Functionality for sending salted SHA-512 of the email address to registered reviewers, which they can use as a token while answering questions
-- New directory called 'registered' with filenames as token for each user . File contains user info and files he/she worked on
-- Set the token in the session key, appended tokens to files when token is set in session key, checking and setting tokens in new end-point /token
-- Username and email validation. Checked if clashes present with the existing registered users
- Used the [[ https://www.mediawiki.org/wiki/API:Categorymembers | Categorymembers API ]] to get articles from different backlog categories. Integrated it with pageviews API to extract top 20%.
***