Page MenuHomePhabricator

GSoC 2022 (Proposal): Campaigns Retention Metrics Dashboard
Closed, DeclinedPublic

Description

Profile Information

Name: Akash
Zulip username: akashsuper2000
Web Profile: https://akashsuper2000.github.io/
Resume: https://akashsuper2000.github.io/resume.pdf
Location: Chennai, India
Typical working hours: IST
Proposal document: https://phabricator.wikimedia.org/T306268

Synopsis

Short summary describing your project and how it will benefit Wikimedia projects

Project document: https://phabricator.wikimedia.org/T304826
Campaigns are an integral part of the Wikimedia community aimed to encourage new and existing users to contribute data/information to the repository. Therefore, it is essential to understand the impact and the user retention of such campaigns. The goal of this project is to develop a metrics dashboard that provides insights on user retention over different time intervals. To achieve this, an ETL pipeline should be built that ingests and processes data from the relevant sources into a graph-feedable format. Insightful graphs are created from this data which are displayed to the user.

Possible Mentors

@Jayprakash12345: https://phabricator.wikimedia.org/p/Jayprakash12345/
@KCVelaga: https://phabricator.wikimedia.org/p/KCVelaga/
@Sadads: https://phabricator.wikimedia.org/p/Sadads/

Have you contacted your mentors already?

Yes, I have contacted the mentors through Wikimedia's Zulip chat.

System design

Proposed system

Metrics for the selected campaigns are to be updated on a periodical basis. The proposal is to create a new service that handles data ingestion from Wikimedia's common database (or any other source), processing and rendering the required plots.

Architecture diagram

architecture_diagram.jpg (750×1 px, 63 KB)

Note: Wikimedia's backend server would query the database/application assuming the retention data would be displayed on the Campaign page/sub-page.

Application

A Python application is created that handles the required processing. The web framework that serves the UI requests is built on the Python-based Flask framework.
The service can be run on-demand or as a cron job.

Cron job

The function of the cron job would be to query the database for campaign timeline to check whether any campaign has ended 1 month ago, 3 months ago, or 6 months ago. With this information, the user data for the corresponding campaigns are retrieved to calculate the retention statistics to render the required charts.
This job is triggered daily/weekly depending on the requirements.

Database

A database is required to store the retention statistics either as rendered images for static graphs or as raw data that can be readily consumed by the front-end application. The structure/type of the database depends on the data that is expected to be stored.

StaticRaw data
Quicker load timesSlower load times because the graphs have to be rendered using the data before displaying it to the UI
Does not allow user interactivityAllows user interactivity
Storage space depends on the quality of the images expectedStorage space depends on the size of the aggregated data
ETL pipeline

An Extract-Transform-Load pipeline is developed within the main application that is responsible for data ingestion, cleaning, processing, and aggregation. Numpy and Pandas libraries are used as data containers throughout the pipeline. Numpy provides fast data manipulation whereas Pandas allows the data to be present in an easily accessible form.

Input

Input to this pipeline is structured/raw data about users who are active (see active user criteria) for a given campaign.

Output

The end result is neatly structured, aggregated data pertaining to each of the campaigns that are being tracked.

Visualization library

The library responsible for rendering the charts can be a mix of libraries selected from the pool of Matplotlib, Plotly, Seaborn, and Bokeh. The decision of which library would be used for which graph purely depends on the quality and the interactivity expected from the graph. For example, to render a choropleth map, Plotly is a great tool that provides users with ready-made interactivity features like pan, zoom, hover, etc.

Deliverables

The following would be delivered at the end of the program:

  • A functional web application server, complete with database, authentication, and other integrations.
  • A responsive user interface with the required (interactive) charts that display the user retention data.
  • Detailed documentation and guide to work on the developed codebase.
  • Detailed test reports for the end-to-end test suite and load tests.
  • A summary document and a detailed report for the project/program.

Timeline

May 20 - June 12
  • Community bonding - connect with experts and fellow contributors.
  • Refine the proposal by getting it reviewed with the mentor.
  • Finalize the following:
    • Web framework based on stability, speed, simplicity, developer friendliness, etc.
    • UI design based on responsiveness, visual appeal, etc.
    • Visualization library and graphs based on usefulness, clarity, ambiguity, etc.
    • Process type (on-demand, cron job, or preset).
    • Access restrictions, integrations, and other minor design decisions.
  • Get a working understanding of the technologies that are required for the coding phase.
  • Acquire the necessary permissions to work in the Wikimedia developer ecosystem.
June 13 - June 26
  • Ramp up on the developer workflow and code standards.
  • Build the infrastructure for the ETL pipeline.
  • Setup the application server using Flask (or any other web framework).
June 27 - July 10
  • Modify existing API/database permissions to allow required data to be queried by the service.
  • Enable authentication and authorization to the application.
  • Write relevant queries to import the appropriate data and convert it into a DataFrame (or any other data container).
  • Explore if parallelization and stream reads are necessary, given the size of the data.
July 11 - July 24
  • Clean and process the ingested data to convert it into a suitable form ready to be consumed by the plots.
  • Modify the data to accommodate the requirements for each of the graphs.
  • Complete the mid-project report for phase-1 evaluation.
July 25
  • Phase-1 Evaluation.
July 26 - August 7
  • Develop the finalized graphs using the finalized visualization library.
  • Develop the web controllers to accommodate the web pages.
August 8 - August 21
  • Build the user interface using HTML/CSS and enable placeholders for data display.
  • Forward the graphs to the front-end for display.
  • Test the webpage responsiveness and compatibility across browsers and devices.
August 22 - Sept 4
  • Integrate, if required, with internal/external wiki pages.
  • Dockerize the application, if required, and deploy the service.
  • Perform end-to-end integration tests to expose bugs, security vulnerabilities, and other unnatural behavior.
Sept 5 - Sept 11
  • Monitor the metrics and perform load tests to ensure scalability.
  • Complete the necessary documentation guides (different from code documentation) and final project report document.
Sept 12 - Sept 19
  • Final Evaluation.

Participation

Describe how you plan to communicate progress and ask for help, where you plan to publish your source code, etc

During the period of the program, I would do the following:

  • Push my code into the designated remote code repository after performing the required tests and addressing code reviews comments.
  • Write detailed weekly reports through Wiki pages or my blog.
  • Stay up-to-date with my goals as outlined in the timeline.
  • Communicate regularly with mentors and keep them updated about my progress and challenges. Wikimedia mentors use Zulip chat for communication.
  • Submit evaluations on time.
  • Attend any program-related meetings that are hosted.
  • Any other requirements set forth by the organization or GSoC.

About Me

Education

I completed my bachelor's in Computer Science in 2021, with a distinction, from Amrita University, which hosts one of India's top computer science programs. I have also completed multiple specialization courses in Data Science and Machine Learning.

How did you hear about this program?

I have known GSoC for a long time and have even submitted a proposal last year: https://akashsuper2000.github.io/blog/gsoc-2020-proposal

Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?

I have recently started working as a Software Engineer (post my graduation in 2021). However, I have been accepted into a University in the United States for my Master's in Computer Science. Therefore, I would be available for the entirety of the program, except for one week (August 1st, 2022 to August 7th, 2022), when I would be busy with my relocation. Neither my job nor my relocation would affect, in any way, my ability to contribute to the program.

We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?

I am only applying through the Google Summer of Code program.

What does making this project happen mean to you?

Wikimedia's mission is to bring free education to the world, a mission that deeply resonates with me. This opportunity allows me to directly improve this system while being able to learn new technologies, build critical infrastructure, and network with people who also share this vision. Specific to this project, I would be able to put my data science skills to good use by enabling users to understand the impact of various campaigns which translates to a more efficient financial expenditure to grow this community. This is also my gateway to start contributing to open source.

Past Experience

Describe any relevant projects that you've worked on previously and what knowledge you gained from working on them.
Web development

Throughout my undergraduate years, I was involved with projects in web development that enabled me to build solutions that had an immediate impact. Some of the projects include the 'Faculty Dashboard' built using ReachJS that aims to solve the need for a centralized portal for the faculty of my institution, and the 'Voice-based transport inquiry system' built using Java SpringMVC that features an inbuilt voice IO system. I have also worked on Python-based web frameworks like Flask to build quick applications for deploying stats visualizations, running cron jobs, and hosting machine learning models.

Links to applications that are hosted at the moment

Data Science

I have hands-on experience working on a range of projects that utilize data science concepts clustering, hypothesis testing, ranking, regression, and SVM as part of my "Fundamentals of Data Science" course I attended in my college. As part of the course, I got to work with tools like Numpy, Pandas, Matplotlib, Seaborn, Plotly, and Bokeh, allowing me to quickly ramp up to Wikimedia's development ecosystem.

Big data

Through the "Big Data" course I attended in my college and as part of working as a Software Engineer in a huge organization, I got the opportunity to explore and work on big data tools in the Apache Hadoop ecosystem such as MapReduce, Hive, and Pig.

Databases

I have extensively used a variety of diverse databases like MySQL, MongoDB, Aurora RDS, DynamoDB, Cassandra, and Google BigQuery. I believe that these experiences would enable me to transition smoothly into the MariaDB ecosystem here at Wikimedia.

Other - Competitions

My efforts in a diverse set of projects are complemented by my involvement in hackathons and competitions. I have participated in numerous Kaggle competitions, securing multiple medals to rank among the top 200 globally. I have also participated in CTF contests where my team ranked top 100 nationally for two consecutive years.

Describe any open source projects you have contributed to as a user and contributor

While I have a good number of "open-sourced" projects under my belt such as "license plate detection", "voice-based ticket booking system", and "COVID-19 tracker", I do not have first-hand experience contributing to an external open-source project. I believe that this program would be a good starter for just that. Moreover, through this program, I can build valuable connections in the community and get into active open-source participation and contribution.

Other Information

Pre-requisites for the project
Microtasks

Completed both the microtasks assigned to evaluate my candidacy for this project and got them approved by the mentor.

Event Timeline

akashsuper2000 renamed this task from Insert project title here to GSoC 2022: Campaigns Retention Metrics Dashboard.Apr 15 2022, 5:55 PM
akashsuper2000 updated the task description. (Show Details)
akashsuper2000 set the point value for this task to 350.

@akashsuper2000 Hi! I am Srishti, one of the org admins - it's great to see your interest in applying to GSoC with Wikimedia! You can safely ignore this message if you have already followed our participants' guide. As you develop your proposal, we want to ensure that you follow the application process steps: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants#Application_process_steps, primarily communicate with project mentors, integrate their feedback in your proposal, adhere to the guidelines around proposal submission, contribute to microtasks, etc. Let us know if there are any questions!

KCVelaga renamed this task from GSoC 2022: Campaigns Retention Metrics Dashboard to GSoC 2022 (Proposa): Campaigns Retention Metrics Dashboard.Apr 16 2022, 6:06 AM
KCVelaga renamed this task from GSoC 2022 (Proposa): Campaigns Retention Metrics Dashboard to GSoC 2022 (Proposal): Campaigns Retention Metrics Dashboard.
KCVelaga removed the point value for this task.

As the GSoC deadline is soon approaching in less than 24 hours (April 19, 2022, 18:00 UTC), please ensure that the information in your proposal on Phabricator is complete and you have already submitted it on the Google's program website in the recommended format. When you have done so, please move your proposal here on the Phabricator workboard https://phabricator.wikimedia.org/project/board/5716/ from "Proposals in Progress" to the "Proposals Submitted' column by simply dragging it. Let us know if you have any questions.

Gopavasanth added a subscriber: Gopavasanth.

@akashsuper2000 We are sorry to say that we could not allocate a slot for you this time. Please do not consider the rejection to be an assessment of your proposal. We received over 75 quality applications, and we could only accept 10 students. We were not able to give all applicants a slot that would have deserved one, and these were some very tough decisions to make. Please know that you are still a valued member of our community and we by no means want to exclude you. Many students who we did not accept in 2021 have become Wikimedia maintainers, contractors and even GSoC students and mentors this year!

Your ideas and contributions to our projects are still welcome! As a next step, you could consider finishing up any pending pull requests or inform us that someone has to take them over. Here is the recommended place for you to get started as a newcomer: https://www.mediawiki.org/wiki/New_Developers.

If you would still be eligible for GSoC next year, we look forward to your participation