##Profile
Name: Anmol Kalia
Email: anmol.kalia731@gmail.com
IRC or IM networks/handle(s): anmolkalia
Location: Guwahati, India
Typical working hours: 10:00 - 18:00 and 20:00 - 00:00 (UTC +5:30)
##Proposal
**Synopsis:**
MediaWikiAnalysis is a tool to collect statistics from MediaWiki sites, via the MediaWiki API. It is a part of the MetricsGrimoire toolset, and it is currently used for getting information from the MediaWiki.org site, among others. The statistics currently collected by MediaWiki are only a part of what is feasible to collect, and the tool itself can be improved.
We plan to achieve the following **microtasks**:
1. T114437 -> Port MediaWikiAnalysis to SQLAlchemy (**Completed**)
SQLAlchemy is a Python ORM, that could simplify the relationship of WikiMediaAnalysis with the database, and make it in fact database-type-independent. This tasks consists of porting MediaWikiAnalysis to use the high level SQLAlchemy (at the level of the ORM) instead of the current MySQLdb package.
2. T114440 -> Implement some missing information from the MediaWiki API
This task includes a study how the tool works, and what the MediaWiki API offers, and find some information that the API is providing, but the tool is not collecting. I would implement calls to that part of the API to retrieve and store that information in the tool database. To decide what could information from the API could be useful, besides investigation by myself, I could benefit from discussions with seasoned developers on the Phabricator task and #analytics-tech-community-metrics. Another challenge here would be to decide the database schema for the the modified database that will contain all the new information that we will be extracting from the API. I will make an entity relationship (ER) model of the problem and apply schema normalisation to it so that we can reduce storage of redundant information. Then, I will test it with several real cases, of different sizes to gauge performance of the tool.
3. T114439 -> Improve performance of MediaWikiAnalysis
Here I will analyze how the MediaWikiAnalysis tool works when accessing the MediaWiki API, and find a way of improving that, in terms of taking less time to retrieve the same information. For this I will try to reduce redundant API calls and increase the overall efficiency of the process to fetch the information. I will time the tool with and without the improvement for different cases, fetching information of the same kind but whose results are of different sizes, to know how much efficiency was achieved and plot graphs of the same.
4. T116509 -> Develop Data Analytics toolset for MediaWikiAnalysis
The database obtained as a result of a successful completion of the above mentioned microtasks will contain a good amount of information. There could be some queries which are more important or useful than others on this database. Consider for example, finding out what authors were active in a given time frame and who were the top x out of them could be a useful query in order to analyse authors for different wikis. Similarly there could be other useful metrics that would be cumbersome to fetch by writing queries for them each time. Developing a library, using Python/Pandas, which can provide a transparency layer between the database and the user for such analysis of the database would be a useful tool to have. The current GrimoireLib will be an inspiration for this line of development. I will study this library to see what metrics tend to be useful in a open source environment and I will also engage in discussions on the Phabriactor task and #analytics-tech-community-metrics to get opinions from seasoned developers regarding the same. Here the documentation process will be very important to ensure that the user is able to make full use of the library and I will pay very specific attention to this.
**Primary mentor: @jgbarah**
**Co-mentor: @Dicortazar**
**Significance:**
The intention of the project is to extract all history of interest in the MediaWiki system, and produce a database with it, organized in a way which is similar to other MetricsGrimoire tools, and easy to query to calculate parameters of interest. The data to be stored will contain all changes to all pages (such as edits, changes in name, etc.), with all the available information for each change (author, date, kind of change, etc.). Therefater, there will be an analytics toolset to better analyze and visualize the data in the databse.
**Timeline:**
| Time Window | Task | Deliverable |
| Week 1 - Week 3 (7/12/15 -26/12/15) | T114440 | In this time, I will** investigate the MediaWiki API** to understand what further information could be extracted from it. I will confer with my mentor and #analytics-tech-community-metrics in order to find out what information out of that could be useful to extract. Then I will prepare an **ER diagram and schema** of the proposed database structure and ask for suggestions. Thereafter, I will implement API calls to extract the decided attributes, and **publish the code** to git after **testing for bugs and gauging performance** by extracting information of varying sizes. |
| Week 4 - Week 5 (27/12/15 -9/1/16) | T114439 | During this time, I will **analyse the tool **to **eliminate redundant API calls** and increase the overall efficiency of information retrieval. To check for improvements, I will run **tests** of different sizes and **plot graphs** of the time taken. Then I will **commit the code** to git. |
| Week 6 (10/1/16 - 16/1/16) | Documentation | I will use this week for **documentation** of what I have completed until this time. |
| Week 7 - Week 10 (17/1/16 - 13/2/16) | T89135 | Now, I will **analyse the GrimoireLib** to understand what metrics are generally considered useful for the open source community. This will be followed by a **discussion** with my mentors and the #analytics-tech-community-metrics to come finalize the metrics that are useful and important. Then I will develop an analytics library to compute these metrics. This library will undergo **testing** to ensure accuracy and efficiency through real problems of varying sizes. After that, I will **publish the code** to git. |
| Week 11 - Week 12 (14/2/16 - 27/2/16) | Documentation and wrapping up report | This week I will devote to **documenting** the analytics tool set and then I will prepare a **wrapping up report** of the project. |
| Week 13 (29/2/16 - 6/3/16) | Midsemester Examination | I will have my college exams this week. I will complete all my work before this week. |
**Participation:**
This project would require making updates to the git repository for Mediawiki Analysis. I will communicate my progress via reports on Phabricator (short reports every week) and emails (long reports according to timeline)
I will seek help on Phabricator, and IRC channels such as #mediawiki and #wikimedia-dev and #analytics-tech-community-metrics like I have done in the past. I have found all of these very helpful and will continue to use these for any further queries that I have
I am also planning on starting my own blog which I will update every week in order to document the project better
##About me
**Education Status:** I am in the final year of B-Tech from Indian Institute of Technology (IIT) Guwahati. My major is Mathematics and Computing
**Eligible?:** Yes
**Other commitments: **Besides this project, the only other commitment I have during the same time frame would be my coursework
**I heard about this program from:** A classmate who participated in GSoC
**What making this project happen means to me:** I am very enthusiastic about working in Computer Science. In fact, I am in the process of applying to graduate schools for a Masters in CS with specialization in Data Science. This project is important to me because it is a chance to make an impact in the field of my interest, i.e., Data Science. Contributing to this project would make use of a lot of what I have learnt as a part of my coursework in Databases and will give me a chance to solve a real world problem, understand practical problems that users face while dealing with database systems and provide novel solutions to these problems in the form of the analytics tool I will be developing, which is something I have always looked forward to.
##Past Experience
**Please describe your experience with any other FOSS projects as a user and as a contributor:**
I have contributed to a small bug fix in T114437 where I got the chance to learn about the Wikipedia android application prior to contributing to this project and T106781 during the course of working towards this project. Otherwise, I am fairly new to this field.
**Please describe any relevant projects that you have worked on previously and what knowledge you gained from working on them:**
I have worked with python on several occasions while working for research projects or coursework. I have worked with database management through MySQL extensively while building a ride sharing website and a library management portal. So, it is easy for me to adapt to different SQL toolkits. I did a course in database management just last semester where I have learnt a great deal about schema refinement and normalization in order to prevent redundancy in the database but at the same time reducing the number of memory accesses. I feel all of this will be of use to me in the coming months as I work on this project.
**Do you have any past experience working in open source projects (MediaWiki or otherwise)? **
Yes. I have completed the following tasks:
1. T114437
2. T106781