Profile Information
Name: Akash
Zulip username: akashsuper2000
Web Profile: https://akashsuper2000.github.io/
Resume: https://akashsuper2000.github.io/resume.pdf
Location: Chennai, India
Typical working hours: IST
Proposal document: https://phabricator.wikimedia.org/T306268
Synopsis
Short summary describing your project and how it will benefit Wikimedia projects
Project document: https://phabricator.wikimedia.org/T304826
Campaigns are an integral part of the Wikimedia community aimed to encourage new and existing users to contribute data/information to the repository. Therefore, it is essential to understand the impact and the user retention of such campaigns. The goal of this project is to develop a metrics dashboard that provides insights on user retention over different time intervals. To achieve this, an ETL pipeline should be built that ingests and processes data from the relevant sources into a graph-feedable format. Insightful graphs are created from this data which are displayed to the user.
Possible Mentors
@Jayprakash12345: https://phabricator.wikimedia.org/p/Jayprakash12345/
@KCVelaga: https://phabricator.wikimedia.org/p/KCVelaga/
@Sadads: https://phabricator.wikimedia.org/p/Sadads/
Have you contacted your mentors already?
Yes, I have contacted the mentors through Wikimedia's Zulip chat.
System design
Proposed system
Metrics for the selected campaigns are to be updated on a periodical basis. The proposal is to create a new service that handles data ingestion from Wikimedia's common database (or any other source), processing and rendering the required plots.
Architecture diagram
Note: Wikimedia's backend server would query the database/application assuming the retention data would be displayed on the Campaign page/sub-page.
Application
A Python application is created that handles the required processing. The web framework that serves the UI requests is built on the Python-based Flask framework.
The service can be run on-demand or as a cron job.
Cron job
The function of the cron job would be to query the database for campaign timeline to check whether any campaign has ended 1 month ago, 3 months ago, or 6 months ago. With this information, the user data for the corresponding campaigns are retrieved to calculate the retention statistics to render the required charts.
This job is triggered daily/weekly depending on the requirements.
Database
A database is required to store the retention statistics either as rendered images for static graphs or as raw data that can be readily consumed by the front-end application. The structure/type of the database depends on the data that is expected to be stored.
Static | Raw data |
Quicker load times | Slower load times because the graphs have to be rendered using the data before displaying it to the UI |
Does not allow user interactivity | Allows user interactivity |
Storage space depends on the quality of the images expected | Storage space depends on the size of the aggregated data |
ETL pipeline
An Extract-Transform-Load pipeline is developed within the main application that is responsible for data ingestion, cleaning, processing, and aggregation. Numpy and Pandas libraries are used as data containers throughout the pipeline. Numpy provides fast data manipulation whereas Pandas allows the data to be present in an easily accessible form.
Input
Input to this pipeline is structured/raw data about users who are active (see active user criteria) for a given campaign.
Output
The end result is neatly structured, aggregated data pertaining to each of the campaigns that are being tracked.
Visualization library
The library responsible for rendering the charts can be a mix of libraries selected from the pool of Matplotlib, Plotly, Seaborn, and Bokeh. The decision of which library would be used for which graph purely depends on the quality and the interactivity expected from the graph. For example, to render a choropleth map, Plotly is a great tool that provides users with ready-made interactivity features like pan, zoom, hover, etc.
Deliverables
The following would be delivered at the end of the program:
- A functional web application server, complete with database, authentication, and other integrations.
- A responsive user interface with the required (interactive) charts that display the user retention data.
- Detailed documentation and guide to work on the developed codebase.
- Detailed test reports for the end-to-end test suite and load tests.
- A summary document and a detailed report for the project/program.
Timeline
May 20 - June 12
- Community bonding - connect with experts and fellow contributors.
- Refine the proposal by getting it reviewed with the mentor.
- Finalize the following:
- Web framework based on stability, speed, simplicity, developer friendliness, etc.
- UI design based on responsiveness, visual appeal, etc.
- Visualization library and graphs based on usefulness, clarity, ambiguity, etc.
- Process type (on-demand, cron job, or preset).
- Access restrictions, integrations, and other minor design decisions.
- Get a working understanding of the technologies that are required for the coding phase.
- Acquire the necessary permissions to work in the Wikimedia developer ecosystem.
June 13 - June 26
- Ramp up on the developer workflow and code standards.
- Build the infrastructure for the ETL pipeline.
- Setup the application server using Flask (or any other web framework).
June 27 - July 10
- Modify existing API/database permissions to allow required data to be queried by the service.
- Enable authentication and authorization to the application.
- Write relevant queries to import the appropriate data and convert it into a DataFrame (or any other data container).
- Explore if parallelization and stream reads are necessary, given the size of the data.
July 11 - July 24
- Clean and process the ingested data to convert it into a suitable form ready to be consumed by the plots.
- Modify the data to accommodate the requirements for each of the graphs.
- Complete the mid-project report for phase-1 evaluation.
July 25
- Phase-1 Evaluation.
July 26 - August 7
- Develop the finalized graphs using the finalized visualization library.
- Develop the web controllers to accommodate the web pages.
August 8 - August 21
- Build the user interface using HTML/CSS and enable placeholders for data display.
- Forward the graphs to the front-end for display.
- Test the webpage responsiveness and compatibility across browsers and devices.
August 22 - Sept 4
- Integrate, if required, with internal/external wiki pages.
- Dockerize the application, if required, and deploy the service.
- Perform end-to-end integration tests to expose bugs, security vulnerabilities, and other unnatural behavior.
Sept 5 - Sept 11
- Monitor the metrics and perform load tests to ensure scalability.
- Complete the necessary documentation guides (different from code documentation) and final project report document.
Sept 12 - Sept 19
- Final Evaluation.
Participation
Describe how you plan to communicate progress and ask for help, where you plan to publish your source code, etc
During the period of the program, I would do the following:
- Push my code into the designated remote code repository after performing the required tests and addressing code reviews comments.
- Write detailed weekly reports through Wiki pages or my blog.
- Stay up-to-date with my goals as outlined in the timeline.
- Communicate regularly with mentors and keep them updated about my progress and challenges. Wikimedia mentors use Zulip chat for communication.
- Submit evaluations on time.
- Attend any program-related meetings that are hosted.
- Any other requirements set forth by the organization or GSoC.
About Me
Education
I completed my bachelor's in Computer Science in 2021, with a distinction, from Amrita University, which hosts one of India's top computer science programs. I have also completed multiple specialization courses in Data Science and Machine Learning.
How did you hear about this program?
I have known GSoC for a long time and have even submitted a proposal last year: https://akashsuper2000.github.io/blog/gsoc-2020-proposal
Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?
I have recently started working as a Software Engineer (post my graduation in 2021). However, I have been accepted into a University in the United States for my Master's in Computer Science. Therefore, I would be available for the entirety of the program, except for one week (August 1st, 2022 to August 7th, 2022), when I would be busy with my relocation. Neither my job nor my relocation would affect, in any way, my ability to contribute to the program.
We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?
I am only applying through the Google Summer of Code program.
What does making this project happen mean to you?
Wikimedia's mission is to bring free education to the world, a mission that deeply resonates with me. This opportunity allows me to directly improve this system while being able to learn new technologies, build critical infrastructure, and network with people who also share this vision. Specific to this project, I would be able to put my data science skills to good use by enabling users to understand the impact of various campaigns which translates to a more efficient financial expenditure to grow this community. This is also my gateway to start contributing to open source.
Past Experience
Describe any relevant projects that you've worked on previously and what knowledge you gained from working on them.
Web development
Throughout my undergraduate years, I was involved with projects in web development that enabled me to build solutions that had an immediate impact. Some of the projects include the 'Faculty Dashboard' built using ReachJS that aims to solve the need for a centralized portal for the faculty of my institution, and the 'Voice-based transport inquiry system' built using Java SpringMVC that features an inbuilt voice IO system. I have also worked on Python-based web frameworks like Flask to build quick applications for deploying stats visualizations, running cron jobs, and hosting machine learning models.
Links to applications that are hosted at the moment
- COVID-19 dashboard using Flask: https://akashsuper2000.pythonanywhere.com/
- Python executor using Flask: https://akash2000.pythonanywhere.com/
- Faculty dashboard using ReactJS: https://akashsuper2000.github.io/faculty-dashboard/
Data Science
I have hands-on experience working on a range of projects that utilize data science concepts clustering, hypothesis testing, ranking, regression, and SVM as part of my "Fundamentals of Data Science" course I attended in my college. As part of the course, I got to work with tools like Numpy, Pandas, Matplotlib, Seaborn, Plotly, and Bokeh, allowing me to quickly ramp up to Wikimedia's development ecosystem.
Big data
Through the "Big Data" course I attended in my college and as part of working as a Software Engineer in a huge organization, I got the opportunity to explore and work on big data tools in the Apache Hadoop ecosystem such as MapReduce, Hive, and Pig.
Databases
I have extensively used a variety of diverse databases like MySQL, MongoDB, Aurora RDS, DynamoDB, Cassandra, and Google BigQuery. I believe that these experiences would enable me to transition smoothly into the MariaDB ecosystem here at Wikimedia.
Other - Competitions
My efforts in a diverse set of projects are complemented by my involvement in hackathons and competitions. I have participated in numerous Kaggle competitions, securing multiple medals to rank among the top 200 globally. I have also participated in CTF contests where my team ranked top 100 nationally for two consecutive years.
Describe any open source projects you have contributed to as a user and contributor
While I have a good number of "open-sourced" projects under my belt such as "license plate detection", "voice-based ticket booking system", and "COVID-19 tracker", I do not have first-hand experience contributing to an external open-source project. I believe that this program would be a good starter for just that. Moreover, through this program, I can build valuable connections in the community and get into active open-source participation and contribution.
Other Information
Pre-requisites for the project
Microtasks
Completed both the microtasks assigned to evaluate my candidacy for this project and got them approved by the mentor.
- Microtask 1: https://phabricator.wikimedia.org/T304974
- Microtask 2: https://phabricator.wikimedia.org/T305309