Page MenuHomePhabricator

[Spike] Full-year editing stats for Year in Review
Open, MediumPublic

Description

Background

In 2024-2025 the iOS team built a personalized Wikipedia Year in Review feature T371946.We had to make tradeoffs in how we did our editing stats, using a mix of the userContributions API and the Growth Impact Module. The Mobile Apps team would like to include editing stats again in 2025's version, and improve how they were displayed and maybe expand what we show to users.

Our goal with this spike is to understand what our options are for improving our access to editing statistics in a scalable way. One possibility discussed has been improvements to the Impact Module API, but this SPIKE has been edited to be solution agnostic.

Wishlist 2025 year in Review
  • Update frequency: daily, and ability to support a batch request for users at scale in December 2025
    • [Highest priority] Total edit count for ALL of 2025, but not limited to 500 edits
    • [Second priority] Views accumulated in 2025 on pages they've edited (edits can be in 2025 or all time)
    • (Nice to have) # of days that they edited in 2025, with possibility for calculating edit streak
    • (Nice to have) List of article titles they've edited in 2025
    • (Nice to have) Total bytes they've added in 2025
Engineering notes
  • For mobile apps we need the ability to query enough at a scale to cover Android & iOS active accounts holders + sufficient margin to allow for potential growth in response the features release. There will likely be a spike associated with the retreiving of statistics when YiR is released. If folks find out that the Apps feature is doing editing stats, we might get folks downloading the app and logging in just to see it but this tail should be more distributed.
  • The mobile apps would be making the calls as so there is a need for it to be retrievable via an externally accessible. We would need coordination on type of authentication mechanism.
Open questions
  • Are there any changes could be done to the Growth Impact Module's statistics so that clients could access the following statistics for the current calendar year (instead of last 60 days or past 1000 edits) for the above requirements? Please note if any are substantially easier than others (we don't need to use all of them in year in review!)
    • Is it possible to include timestamps with the dates in the response? This would be for time-of-day calculations for a slide like "In 2025, you edited most in the evenings", etc. If it's possible to receive a local edit time, that would be good. Otherwise we will just have to assume all edits were made from the device's timezone at the time of slide creation.
Relevant tasks
Relevant data

3% of freetext feedback asked for improved editing statistics:

  • Users got confused when editing stats did not cover the full year (edits viewed coming from Growth Impact Module API T376353) and assumed they are incorrect.
  • Users want to be reminded of which pages they edited
  • Users are interested to see graphs or lists about their edits
  • One user wanted specific edit count for year, not "500+ edits"

Estimated for a 12-month period, ~95% of iOS editors have fewer than 500 edits, and 99.6% have fewer than 10,000 (source). So solutions that cover the past calendar year (2025), and support up to 1000 edits work for iOS (TBD for Android), and solutions that support up to 10,000 edits and the past calendar year would cover the vast majority of editors.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedHNordeenWMF
OpenNone
OpenNone
OpenOttomata
OpenOttomata
OpenEevans
ResolvedOttomata
ResolvedJMonton-WMF
ResolvedFeatureOttomata
Openamastilovic
Openamastilovic
Resolvedamastilovic
Resolvedxcollazo
OpenJAllemandou
Openxcollazo
OpenNone
ResolvedSnwachukwu
Openamastilovic
Openmforns
OpenOttomata

Event Timeline

HNordeenWMF moved this task from Needs Triage to Up next on the Wikipedia-iOS-App-Backlog board.
HNordeenWMF updated the task description. (Show Details)
HNordeenWMF added a subscriber: KStoller-WMF.
HNordeenWMF added a subscriber: ovasileva.
Tsevener updated the task description. (Show Details)
Tsevener updated the task description. (Show Details)
HNordeenWMF renamed this task from [Spike] Full-year editing stats for Year in Review through Impact Module API to [Spike] Full-year editing stats for Year in Review .May 21 2025, 5:07 PM
HNordeenWMF updated the task description. (Show Details)
HNordeenWMF updated the task description. (Show Details)

Growth recently ran into a performance blocker when it raised the limit of edits a user sees from 1000 to 10000 T341599#10839555. This wasn't a surprising event since we knew that the impact module API wasn't meant to be consumed at such a scale when it was initially built otherwise we would have designed/implemented it differently.

I'm proposing advocating for the work on T341649: Provide an easy way for MediaWiki to fetch aggregate data from the data lake, which would benefit more feature teams that want to access such aggregate data in a performant way. I can see that Data Platform Engineering has pulled this task in it's essential work bucket {T341649#10744904} and is on their Up Next column.

Thank you, @HNordeenWMF and team, for creating this overview of the requirements!

+1 to what @DMburugu said: The proper, sustainable, and scalable solution for this is T341649: Provide an easy way for MediaWiki to fetch aggregate data from the data lake.

If that is not at all possible, then in principle a second system like the GrowthExperiments Impact module could be built that would collect the Impact data as required for YiR 2025. However, that would be a new system largely separate from what currently exists in Growth.
The current system in GrowthExperiments: It stores the calculated user-impact data for 2 (two!) days and then throws it completely away. And then GrowthExperiments recalculates it from scratch if it is needed again. That is not a sensible basis for what is requested here.
So, both how we collect the data (for a static past section of time, vs a rolling 60 day/X edits window) and how we have to store the data (stored once and never expires, vs. being continuously deleted and recreated), both are very different from each other.
Creating such a system would be a substantial time investment, not something that we can do on the side. That means it would trade off against our other annual plan work and we would need a dedicated hypothesis around it. Something like "If a MediaWiki-based system is built in Q1 that can calculate and store the impact of all users that edited in that year, then the Apps teams can build an highly engaging Year in Review experience on top of that in Q2.". And leadership would need to approve us (or whoever) spending engineering resources on that and less on our current Q1 engineering hypothesis.
To be clear: that would mean building another hacky single-purpose system on top of infrastructure that was very much not intended for this use case. This is strictly inferior to T341649.

A third option: Could you share more about how the data for Year in Review 2024 was collected? Maybe it might be more feasible to build on top of that approach for 2025, and Growth could consult on that work as needed?

Moving to Triaged on Growth's workboard for now, since we won't be prioritizing this unless leadership communicates that this is higher-priority than our current annual plan work.

A third option: Could you share more about how the data for Year in Review 2024 was collected? Maybe it might be more feasible to build on top of that approach for 2025, and Growth could consult on that work as needed?

@Michael for 2024, we were in a "scrappy experiment" mindset. We only did 2 editing insights with what we could access

  • Total number of edits (T376320), sourced from the user contributions API, https://www.mediawiki.org/wiki/API:Usercontribs. We ran into scaling concerns from other teams, so we worked around that by only calling the API once per user, and just getting the first 500 edits of their year.
  • Views on edits recently (T376353), sourced from the Growth Impact API endpoint. We said "recently", because this was only the past 60 days of their edit views.

A third option: Could you share more about how the data for Year in Review 2024 was collected? Maybe it might be more feasible to build on top of that approach for 2025, and Growth could consult on that work as needed?

@Michael for 2024, we were in a "scrappy experiment" mindset. We only did 2 editing insights with what we could access

  • Total number of edits (T376320), sourced from the user contributions API, https://www.mediawiki.org/wiki/API:Usercontribs. We ran into scaling concerns from other teams, so we worked around that by only calling the API once per user, and just getting the first 500 edits of their year.
  • Views on edits recently (T376353), sourced from the Growth Impact API endpoint. We said "recently", because this was only the past 60 days of their edit views.

Thanks! Maybe that is something can be improved upon. If we succeed in rolling out T341599: Impact Module: improvements for former newcomers to all wikis, then you should hopefully be able to use the existing Growth Impact API endpoint to get the number of edits for 2025 for almost all of your users. Even more so if you accumulate that data and store it locally (which I would be recommended anyway).

Improving the number of views would be trickier. We merely use the PageViewInfo extension for those ourselves as well, and that is limited to 60 days. I'm not familiar with that extension or why it has that limit. It was created by @Legoktm and https://www.mediawiki.org/wiki/Developers/Maintainers lists it as Unassigned.

All of that is to say: if we can get T341649 done instead, then that would be great! Because it would allow Growth to display better statistics too! (And also get rid of a lot of legacy code, which I'm also a big fan of.)

A third option: Could you share more about how the data for Year in Review 2024 was collected? Maybe it might be more feasible to build on top of that approach for 2025, and Growth could consult on that work as needed?

@Michael for 2024, we were in a "scrappy experiment" mindset. We only did 2 editing insights with what we could access

  • Total number of edits (T376320), sourced from the user contributions API, https://www.mediawiki.org/wiki/API:Usercontribs. We ran into scaling concerns from other teams, so we worked around that by only calling the API once per user, and just getting the first 500 edits of their year.
  • Views on edits recently (T376353), sourced from the Growth Impact API endpoint. We said "recently", because this was only the past 60 days of their edit views.

Regarding load on the Growth Impact API, I'm not sure. We don't have good data for this. But when did you roll out YIR 2024? If it was mid to end January, then that big blue spike was probably you:

image.png (343×604 px, 27 KB)

Which would mean only double the normal amount of recalculations to do (assuming that this data is correct, which I'm not sure of), and that does not sound too bad. Though, to be honest, our insight into the health of that endpoint is rather limited. I would hope that this amount of traffic should also be OK with a limit of 10,000 edits, but I would also feel better if we could think about mitigation strategies. (If we go this route this year at all, rather than T341649, which would be better.)