Page MenuHomePhabricator

Create "EditCountByMonth" method
Open, LowPublic

Description

As part of the QuickSurbey enhacement, there is a need to create a method that returns the Edits Count of a specific user by month. The discussion that highlights this requirement is highlighted in his document: Discussion notes on Quick SUrvey auditing enhancement.

AC:

  • Create a new public method that will allow to fetch count by months
  • The endpoint will NOT accept data ranges
  • The endpoint is going to return the last 2 years + current months (so, for example, in this case, it will return 2022/2023 and 2024 (so this year - 2)
  • The endpoint need to enable caching (can be done in multiple patches to slimline the process and make review easier)

Event Timeline

Amdrel changed the task status from Open to In Progress.May 16 2024, 7:23 PM
Amdrel removed Amdrel as the assignee of this task.
Amdrel claimed this task.
Amdrel subscribed.

hi @Tgr,

If I am not mistaken, Sai Suman already informed you of my need to ask for some information regarding caching.

Context
We are adding a method to the userEditTracker that can return user edits by months. The QuickSurvey will use this to sample its users, and we thought it was a good idea to add caching to alleviate the weight on the DB server as the query is light but still complex.

I am looking for some guidance on the caching strategy that we should adopt. We had an idea to follow the current implementation provided by the "EditUserCount," but that is just an INT per user, so I'm not sure if the fact that we return an object changes things.

For each user, we will fetch a MAX of 2 years, and the output looks something like the following image:

image.png (270×576 px, 30 KB)

Things to consider:

  • Our cache will have to be updated when an user edits a new article (similarly to what the editUserCount) does
  • Our cache will need to be invalidated or updated every new month.

Out questions assumption

  • Do you have examples of similar implementations of what we are trying to achieve?
  • Are we ok to re-use what is currently available in the EditUserCount? (the overall architecture)
  • EditUserCount, uses a mix of "in memory" cache and changes to the "user record". At this stage (and due to our data) I do not think we will be able to also store things in the user record, so we will probably just rely at in memory, and this can get messy due to the size of our data. What do you suggest?

I currently live in UK, but work long hours, so happy to jump on a call if that is easier for you.

Hi, sorry for the slow answer. (Feel free to ping me on Slack for more timely responses.)

I thought we had monthly edit counts in English Wikipedia until a couple years ago, but can't find any trace of it. Maybe someone in WMF Product Analytics remembers?
The GrowthExperiments extension tracks daily edits, but it uses its own DB table to cache the data.

Whether you return an int or a smallish object doesn't really make any difference. But the edit count is stored in the DB (so it's a primary key lookup), and for a count you need to scan all the revision rows belonging to the user, which in edge cases could be hundreds of thousands of rows, so query performance on a cache miss is very different. I think the most feasible approach is to limit the query to (say) 1000 edits. This means all users with 1000+ edits in the last 2 years are in the same bucket.
(Note this only works if your query matches a DB index, presumably rev_actor_timestamp. If you e.g. limit the query to main namespace articles, then a LIMIT clause won't necessarily limit the number of rows scanned and things get more difficult.)

Are we ok to re-use what is currently available in the EditUserCount? (the overall architecture)

I am not sure I understand the question. This code would live in QuickSurvey, right?

SimoneThisDot changed the task status from In Progress to Open.Jun 3 2024, 6:39 AM

This update has been parked for now and will be reworked in the future when more time will be allocated to the project.

Jdlrobson-WMF added a project: QuickSurveys.
Jdlrobson-WMF moved this task from Not triaged to Next on the QuickSurveys board.
Jdlrobson-WMF subscribed.

I think this API would be useful but it shouldn't be exclusive to QuickSurveys as it could potentially be used elsewhere. This could be added inside MediaWiki core for example if it existed. Once that API exists you can use the API described in https://www.mediawiki.org/wiki/Extension:QuickSurveys#Advanced_audience_targetting