##Profile
Name: Anmol Kalia
Email: anmol.kalia731@gmail.com
IRC or IM networks/handle(s): anmolkalia
Location: Guwahati, India
Typical working hours: 10:00 - 18:00 and 21:00 - 01:00 (UTC +5:30)
##Proposal
**Synopsis:**
MediaWikiAnalysis is a tool to collect statistics from MediaWiki sites, via the MediaWiki API. It is a part of the MetricsGrimoire toolset, and it is currently used for getting information from the MediaWiki.org site, among others. The statistics currently collected by MediaWiki are only a part of what is feasible to collect, and the tool itself can be improved.
We plan to achieve the following **microtasks**:
1. T114437 -> Port MediaWikiAnalysis to SQLAlchemy
SQLAlchemy is a Python ORM, that could simplify the relationship of WikiMediaAnalysis with the database, and make it in fact database-type-independent. This tasks consists of porting MediaWikiAnalysis to use the high level SQLAlchemy (at the level of the ORM) instead of the current MySQLdb package.
2. T114440 -> Implement some missing information from the MediaWiki API
This task includes a study how the tool works, and what the MediaWiki API offers, and find some information that the API is providing, but the tool is not collecting. I would implement calls to that part of the API to retrieve and store that information in the tool database. To decide what could information from the API could be useful, besides investigation by myself, I could benefit from discussions with seasoned developers on the Phabricator task and #analytics-tech-community-metrics. Another challenge here would be to decide the database schema for the the modified database that will contain all the new information that we will be extracting from the API. I will make an entity relationship (ER) model of the problem and apply schema normalisation to it so that we can reduce storage of redundant information. Then, I will test it with several real cases, of different sizes to gauge performance of the tool.
3. T114439 -> Improve performance of MediaWikiAnalysis
Here I will analyze how the MediaWikiAnalysis tool works when accessing the MediaWiki API, and find a way of improving that, in terms of taking less time to retrieve the same information. For this I will try to reduce redundant API calls and increase the overall efficiency of the process to fetch the information. I will time the tool with and without the improvement for different cases, fetching information of the same kind but whose results are of different sizes, to know how much efficiency was achieved and plot graphs of the same.
4. T116509 -> Develop Data Analytics toolset for MediaWikiAnalysis
The database obtained as a result of a successful completion of the above mentioned microtasks will contain a good amount of information. There could be some queries which are more important or useful than others on this database. Consider for example, finding out what authors were active in a given time frame and who were the top x out of them could be a useful query in order to analyse authors for different wikis. Similarly there could be other useful metrics that would be cumbersome to fetch by writing queries for them each time. Developing a library, using Python/Pandas, which can provide a transparency layer between the database and the user for such analysis of the database would be a useful tool to have. The current GrimoireLib will be an inspiration for this line of development. I will study this library to see what metrics tend to be useful in a open source environment and I will also engage in discussions on the Phabriactor task and #analytics-tech-community-metrics to get opinions from seasoned developers regarding the same. Here the documentation process will be very important to ensure that the user is able to make full use of the library and I will pay very specific attention to this.
**Primary mentor: @jgbarah**
**Co-mentor: @Dicortazar**
**Deliverables:**
1. **Week 1 - Week 3 (7/12/15 -26/12/15)**: T114440 -> Investigation, Coding, Commits
2. **Week 4 - Week 6 (27/12/15 -16/1/16)**: T114439 -> Investigation, Coding, Commits, Graphs of Improvement
3. **Week 7 (17/1/16 - 23/1/16)**: Testing for bugs, Documentation
4. **Week 8 - Week 10 (24/1/16 - 13/2/16)**: T89135 -> Investigation, Coding, Commits
5. **Week 11 - Week 12 (14/2/16 - 27/2/16)**: Testing for bugs, Documentation
6.** Week 13 (28/2/16 - 5/3/16)**: Buffer Time
**Participation:**
This project would require making updates to the git repository for Mediawiki Analysis. I will communicate my progress via reports on Phabricator
I will seek help on Phabricator, and IRC channels such as #mediawiki and #wikimedia-dev and #analytics-tech-community-metrics like I have done in the past. I have found all of these very helpful and will continue to use these for any further queries that I have
I am also planning on starting my own blog which I will update every week in order to document the project better
##About me
**Education Status:** I am in the final year of B-Tech from Indian Institute of Technology (IIT) Guwahati. My major is Mathematics and Computing
**Eligible?:** Yes
**Other commitments: **Besides this project, the only other commitment I will have during the same time frame would be my coursework
**I heard about this program from:** A classmate who participated in GSoC
**What making this project happen means to me:** I am very enthusiastic about working in Computer Science. In fact, I am in the process of applying to grad schools for a Masters in CS with specialization in Data Science. This project is important to me because it is a chance to make an impact in the field of my interest, i.e., Data Science. Contributing to this project would make use of a lot of what I have learnt as a part of my coursework in Databases and will give me a chance to solve a real world problem, understand practical problems that users face while dealing with database systems and provide novel solutions to these problems, which is something I have always looked forward to.
##Past Experience
**Please describe your experience with any other FOSS projects as a user and as a contributor:**
I have contributed to a small bug fix in T114437 prior to contributing to this project and T106781 during the course of working towards this project. Otherwise, I am fairly new to this field.
**Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them:**
I have worked with python on several occasions while working for research projects or coursework. I have worked with database management extensively while building a ride sharing website and a library management portal. I did a course in database management just last semester where I have learnt a great deal about schema refinement and normalization in order to prevent redundancy in the database but at the same time reducing the number of memory accesses. I feel all of this will be of use to me in the coming months as I work on this project/
**Do you have any past experience working in open source projects (MediaWiki or otherwise)? **
Yes. I have completed the following tasks:
1. T114437
2. T106781