Background and Intro
In Wikimedia, campaigns are activities run annually by many volunteer and partner-led communities to encourage new and existing users to contribute images, data and information in categories like earth, science, monuments, art etc. One of the key impacts of campaigns is, it acts as a way to introduce the Wikimedia projects for users who are new to the contributing side of it. Campaigns are considered to be an easy gateway for new users to get acquainted with the process of adding and modifying content across the Wikimedia projects like uploading a document/image to Wikimedia Commons or editing an article on Wikipedia. As new users enter the Wikimedia ecosystem through campaigns, it would be interesting to track the statistics of these users in order to understand user retention and quantify impact of campaigns in this area.
With inspiration from Wiki Loves stats tool, the idea is to develop a dashboard that can track and share retention metrics of participants, especially newcomers after a particular campaign ends. For this, we need to monitor the contributions of the users across Wikimedia projects after the end of a particular campaign. Initially, our scope will be limited to new users from photo campaigns and understand their retention over various Wikimedia projects after the end of the campaign over regular time intervals: 3, 6, and 12 months.
Project Stages
- Stage 1 - ETL pipeline and dataset prep: As we get started with the project, we will need plan how we would want to extract, transform and load the data before it can be put into the dashboard. We might initially start exploring for a single campaign/category and scale it up to 3-4 campaigns. This stage will include using SQL to extract data from MediaWiki MariaDB databases, if needed process MediaWiki history dumps, and applying to necessary transformations in Python to arrive at the required metrics. We will start with preparing dataset for a single campaign to track contributions of participants after the campaign is over over a period of 3, 6, 9 and 12 months across all Wikimedia projects.
- Stage 2 - Visualisations and Dashboarding: The prepared dataset needs to be visualized and a web app to be created. We are open about the final framework / library we will be using to deploy, it depends on how the project needs emerge and skills of the selected student. Some options are HTML/CSS front-end with Flask/Django, Streamlit, and Dash.
- Stage 3: Time permitting, we will be enable country-level filtering of data which can help us understand user retention metrics in a particular country.
Mentors
Skills required
Must haves
- SQL and using Python for data analysis (Pandas and Numpy libraries)
- Knowledge of at least one Python visulization library (Matplotlib, Seaborn, Plotly, Bokeh etc.) and be willing to learn others if required.
- Knowledge of HTML/CSS and Flask/Django (basic understanding is fine; or be open to learning during the community bonding period).
Preferred
- Experience with big data tools such as Spark, Hive
- Experience with building data-related web applications or web-apps in general.
- Basic knowledge of Kubernetes
Time commitment / Difficulty
- 350 hours / hard: The complexity of the project mainly comes from the initial processing of gigantic amount of edits data (be it from MariaDB or dumps) to determine user activity.
Getting started
- Understanding how Wikipedia works
- Understand Wikimedia Commons
- Understand what campaigns are and how they work:
- Get familiar with MediaWiki database layout
- Get familiar with MediaWIki Query API
- Read about MediaWiki history dumps (optional, not required to do the micro-tasks)
Micro-tasks
- Microtask 1: T304974 [Time estimate to complete: 30 min - 2 hrs (depending on familiarity with the MediaWiki database)]
- Microtask 2: T305309 [Time estimate to complete: 30 min]
Note: The micro-tasks are not interdependent and can be completed in any order, however, Microtask 1 will have a greater weight during application evaluation.
Additional information for applicants
- For any doubts, you can post them to this ticket or ask on Zulip stream #gsoc2022: campaign retention metrics dashboard
- Please share your past work in your application to demonstrate your experience with data analysis/handling/processing using either Python or SQL. You may share links or the files directly as well.