Wikipedia Analytics Request
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
The TechWish team is targeting the following behavior change in 2026:
- We want to see an increase in newcomers (< 100 edits) returning to make constructive reference edits. If we see an increase during the period in which they are still defined as newcomers, we hypothesise that that the improvements we make are helping them understand and learn how to use the tooling for citations. This also means that when they grow into more experienced users (as measured by the number of edits) they are equipped with the means to continue creating constructive edits and are supported by the tooling.
Constructive references edits is defined as published edits that include a reference and that are not reverted within 48 hours of being published.
The success metric for this is:
The percentage of newcomers that have successfully published their first reference using VisualEditor (with or without ReferenceCheck), that return to add another reference without ReferenceCheck within y days, increases by x%.
Please see notes from chat with Megan Neisler.
Questions to be answered
- What is the rate (within how many days) at which newcomers return to create an edit?
- What is the rate (within how many days) at which newcomers return to create an edit with references?
- What is the percentage of newcomers that return to add another reference?
- What is a plausible increase that we can work towards in 2026? Determining the x% increase.
- What is the total number of newcomers
Desired Outputs
The desired outputs of this task are listed and confirmed as being finished below.
- Superset report for de.wiki showing total number of newcomers overtime.
- Superset report showing percentage of newcomers returning to add another reference within x days.
Deadline
Please make the time sensitivity of this request clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.
- First round of results preferably by 30.11.2025 so that we can re-adjust plan if need be.
- The data points have to be available beginning of Q1 2026 so that the numbers can be updated on meta.wikipedia.org.
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Explore potentially useful tables (DataHub)
- wmf.mediawiki_history
- editcheck-newreference revision tag for all edits made using the visual editor to pages in the main namespace that involve an edit where people add a net new reference
- revision_is_identity_reverted AND revision_seconds_to_identity_revert <= 172800 for reverted in 48 hrs (not a "constructive" edit)
- event_user_revision_count < 100 for newcomer
- research.mediawiki_content_diff - parse parent_revision_diff to identify URLs
- canonical_data.wikis for deriving Wikipedias (database_code to join and database_group = 'wikipedia')
- wmf.mediawiki_history
- Derive base check metric to compare results against
- Check metric: Check against metrics in Contributors Metrics dashboards
- Read related research
- Research:Analyzing_sources_on_Wikipedia
- References mostly done via:
- Research:Analyzing_sources_on_Wikipedia
- Check past work related to this process
- collect_revision_citation_data.ipynb
- reference-changes.ipynb
- newcomer-peacock (can be edited to derive revisions that added URLs)
- collect_constructive_edits.ipynb
- Investigate other metadata that's easily accessible with the edits to see if potential breakdowns would be useful
- No fields within wmf.mediawiki_history looked useful used with these fields, and the process is quite resource intensive, so I made the decision to stick with what we have
- Finalize outputs for metrics data process (check with TechWish Product and Engineering)
- Write needed create table queries
- Transfer work into metrics HQL scripts
- Write Airfow DAG and testing files
- Test and deploy Airflow DAG
Estimation
Estimate: 5 days
Actual:
Data
The tables that will be referenced in this task.
- wmf.mediawiki_history
- canonical_data.wikis
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- Note