Page MenuHomePhabricator

Outreachy Proposal for Improving MediaWikiAnalysis
Closed, DeclinedPublic

Description

Profile

Name: Anmol Kalia
Email: anmol.kalia731@gmail.com
IRC or IM networks/handle(s): anmolkalia
Location: Guwahati, India
Typical working hours: 10:00 - 18:00 and 20:00 - 00:00 (UTC +5:30)

Proposal

Synopsis:
MediaWikiAnalysis is a tool to collect statistics from MediaWiki sites, via the MediaWiki API. It is a part of the MetricsGrimoire toolset, and it is currently used for getting information from the MediaWiki.org site, among others. The statistics currently collected by MediaWiki are only a part of what is feasible to collect, and the tool itself can be improved.

We plan to achieve the following microtasks:

  1. T114437 -> Port MediaWikiAnalysis to SQLAlchemy (Completed)

SQLAlchemy is a Python ORM, that could simplify the relationship of WikiMediaAnalysis with the database, and make it in fact database-type-independent. This tasks consists of porting MediaWikiAnalysis to use the high level SQLAlchemy (at the level of the ORM) instead of the current MySQLdb package.

  1. T114440 -> Implement some missing information from the MediaWiki API

This task includes a study how the tool works, and what the MediaWiki API offers, and find some information that the API is providing, but the tool is not collecting. I would implement calls to that part of the API to retrieve and store that information in the tool database. To decide what could information from the API could be useful, besides investigation by myself, I could benefit from discussions with seasoned developers on the Phabricator task and wikimedia.biterg.io. Another challenge here would be to decide the database schema for the the modified database that will contain all the new information that we will be extracting from the API. I will make an entity relationship (ER) model of the problem and apply schema normalisation to it so that we can reduce storage of redundant information. Then, I will test it with several real cases, of different sizes to gauge performance of the tool.

  1. T114439 -> Improve performance of MediaWikiAnalysis

Here I will analyze how the MediaWikiAnalysis tool works when accessing the MediaWiki API, and find a way of improving that, in terms of taking less time to retrieve the same information. For this I will try to reduce redundant API calls and increase the overall efficiency of the process to fetch the information. I will time the tool with and without the improvement for different cases, fetching information of the same kind but whose results are of different sizes, to know how much efficiency was achieved and plot graphs of the same.

  1. T116509 -> Develop Data Analytics toolset for MediaWikiAnalysis

The database obtained as a result of a successful completion of the above mentioned microtasks will contain a good amount of information. There could be some queries which are more important or useful than others on this database. Consider for example, finding out what authors were active in a given time frame and who were the top x out of them could be a useful query in order to analyse authors for different wikis. Similarly there could be other useful metrics that would be cumbersome to fetch by writing queries for them each time. Developing a library, using Python/Pandas, which can provide a transparency layer between the database and the user for such analysis of the database would be a useful tool to have. The current GrimoireLib will be an inspiration for this line of development. I will study this library to see what metrics tend to be useful in a open source environment and I will also engage in discussions on the Phabriactor task and wikimedia.biterg.io to get opinions from seasoned developers regarding the same. Here the documentation process will be very important to ensure that the user is able to make full use of the library and I will pay very specific attention to this.

Primary mentor: @jgbarah
Co-mentor: @Dicortazar

Significance:

The intention of the project is to extract all history of interest in the MediaWiki system, and produce a database with it, organized in a way which is similar to other MetricsGrimoire tools, and easy to query to calculate parameters of interest. The data to be stored will contain all changes to all pages (such as edits, changes in name, etc.), with all the available information for each change (author, date, kind of change, etc.). Therefater, there will be an analytics toolset to better analyze and visualize the data in the databse.

Timeline:

Time WindowTaskDeliverable
Week 1 - Week 3 (7/12/15 -26/12/15)T114440In this time, I will investigate the MediaWiki API to understand what further information could be extracted from it. I will confer with my mentor and wikimedia.biterg.io in order to find out what information out of that could be useful to extract. Then I will prepare an ER diagram and schema of the proposed database structure and ask for suggestions. Thereafter, I will implement API calls to extract the decided attributes, and publish the code to git after testing for bugs and gauging performance by extracting information of varying sizes.
Week 4 - Week 5 (27/12/15 -9/1/16)T114439During this time, I will analyse the tool to eliminate redundant API calls and increase the overall efficiency of information retrieval. To check for improvements, I will run tests of different sizes and plot graphs of the time taken. Then I will commit the code to git.
Week 6 (10/1/16 - 16/1/16)DocumentationI will use this week for documentation of what I have completed until this time.
Week 7 - Week 10 (17/1/16 - 13/2/16)T89135Now, I will analyse the GrimoireLib to understand what metrics are generally considered useful for the open source community. This will be followed by a discussion with my mentors and the wikimedia.biterg.io to come finalize the metrics that are useful and important. Then I will develop an analytics library to compute these metrics. This library will undergo testing to ensure accuracy and efficiency through real problems of varying sizes. After that, I will publish the code to git.
Week 11 - Week 12 (14/2/16 - 27/2/16)Documentation and wrapping up reportThis week I will devote to documenting the analytics tool set and then I will prepare a wrapping up report of the project.
Week 13 (29/2/16 - 6/3/16)Midsemester ExaminationI will have my college exams this week. I will complete all my work before this week.

Participation:

This project would require making updates to the git repository for Mediawiki Analysis. I will communicate my progress via reports on Phabricator (short reports every week) and emails (long reports according to timeline)
I will seek help on Phabricator, and IRC channels such as MediaWiki-General and #wikimedia-dev and wikimedia.biterg.io like I have done in the past. I have found all of these very helpful and will continue to use these for any further queries that I have
I am also planning on starting my own blog which I will update every week in order to document the project better

About me

Education Status: I am in the final year of B-Tech from Indian Institute of Technology (IIT) Guwahati. My major is Mathematics and Computing

Eligible?: Yes

Other commitments: Besides this project, the only other commitment I have during the same time frame would be my coursework. I have 36 credits next semester. I usually have 35-40 credits each semester. I have a total of 4 courses, Computational Finance, Graph Theory, Parallel Computing and an HSS elective on Indian history and a course project. I will have classes between 9 AM to 11 AM (UTC+5:30).

I heard about this program from: A classmate who participated in GSoC

What making this project happen means to me: I am very enthusiastic about working in Computer Science. In fact, I am in the process of applying to graduate schools for a Masters in CS with specialization in Data Science. This project is important to me because it is a chance to make an impact in the field of my interest, i.e., Data Science. Contributing to this project would make use of a lot of what I have learnt as a part of my coursework in Databases and will give me a chance to solve a real world problem, understand practical problems that users face while dealing with database systems and provide novel solutions to these problems in the form of the analytics tool I will be developing, which is something I have always looked forward to.

Past Experience

Please describe your experience with any other FOSS projects as a user and as a contributor:
I have contributed to a small bug fix in T114437 where I got the chance to learn about the Wikipedia android application prior to contributing to this project and T106781 during the course of working towards this project. Otherwise, I am fairly new to this field.

Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them:
I have worked with python on several occasions while working for research projects or coursework. I have worked with database management through MySQL extensively while building a ride sharing website and a library management portal. So, it is easy for me to adapt to different SQL toolkits. I did a course in database management just last semester where I have learnt a great deal about schema refinement and normalization in order to prevent redundancy in the database but at the same time reducing the number of memory accesses. I feel all of this will be of use to me in the coming months as I work on this project.

Do you have any past experience working in open source projects (MediaWiki or otherwise)?
Yes. I have completed the following tasks:

  1. T114437
  2. T106781

Event Timeline

Anmolkalia raised the priority of this task from to Lowest.
Anmolkalia updated the task description. (Show Details)
Anmolkalia moved this task from Backlog to Proposals Submitted on the Outreachy-Round-11 board.

@jgbarah, @Dicortazar, please suggest improvements in the proposal. Thank you.

We are approaching the Outreachy'11 application deadline, and if you want to have your proposal considered to be part of this round, do sign up and add your proposal at https://outreachy.gnome.org/ before November 02 2015, 07:00 pm UTC. You can copy-paste the above proposal to the Outreachy application system, and keep on polishing it over here. Keep in mind that your mentors and the organization team will be evaluating your proposal here in Phabricator, and you are free to ask and get more reviews complying https://www.mediawiki.org/wiki/Outreach_programs/Life_of_a_successful_project#Answering_your_questions

We find that you are having university/school during the Outreachy round 11 internship period ( Dec 2015 - March 2016 ). Please fill in the following details too in your proposal description so that we stick to the Outreachy norms.

https://wiki.gnome.org/Outreachy#Application_Form

Will you have any other time commitments, such as school work, exams, research, another job, planned vacation, etc., between December 7, 2015 and March 7, 2016? How many hours a week do these commitments take? If a student, please list the courses you will be taking between December 7, 2015 and March 7, 2016, how many credits you will be taking, and how many credits a full-time student normally takes at your school:

Thank you for your proposal. Sadly, the Outreachy administration team made it strict that candidates with any kind of academic/other commitments are not approved for this round. Please consider talking with your mentors about further steps, and we hope to see your application ( the same too, if the consensus still exist ) in the next round of GSoC/Outreachy. Closing the same as declined, and all the very best for the next round!