Page MenuHomePhabricator

Build statistics toolset to support WM-HU editor retention grant
Open, Needs TriagePublic


Wikimedia Hungary is kicking of a one-year project to improve editor retention in the Hungarian Wikipedia (supported by a WMF grant). See details here. Doing this effectively will require much more detailed statistics than currently available. Build a contributor statistics portal for the Hungarian Wikipedia that displays statistics and editor lists relevant for the program.

Functional requirements

The portal should provide the following information:

  • a "funnel" view of the Hungarian Wikipedia community: given a number of user categories like users with 1-10 edits, users with 100+ edits in the last 30 days, administrators, users who did more than 10 reviews in the last 30 days etc.
    • show the size of each group
    • show the transitions between the groups (the number of editors who moved from one group to another in some given time frame)
    • (stretch goal) show historic trends for these groups
  • (stretch goal) where it makes sense, the same statistics with the number of edits instead of the number of editors (number of edits coming from editors with 1-10 edits etc).
  • lists of editors who are potential targets for intervention: transitioning from one group to another (e.g. recently registered editors; editors who have recently stopped participating), made some achievement and no one followed up yet (e.g. recently registered editors who have not been welcomed yet, recently reached their 1000th edit and not congratulated yet), had some negative interaction (e.g. first edit reverted)
    • (stretch goal) annotate lists with data pulled in from other sources (such as the ORES edit scoring service, or the review API) to identify users who are special in some way (e.g. well-intentioned but struggling with wiki syntax, or stuck in the review queue)
    • an API to expose this information in a machine-readable way
    • export in whatever format is convenient to the people who will follow up on these lists (e.g. wikitable or CSV)
  • top lists of editors who perform a certain task (e.g. administrative actions, edit reviews, template edits) plus ratio of the total amount of tasks they perform
    • an API to expose this information in a machine-readable way
    • export in whatever format is convenient to the people who will follow up on these lists (e.g. wikitable or CSV)
  • where it makes sense, support filtering / splitting results on manually provided username lists (this will be used to assess the effectiveness of interventions)
  • (stretch goal) a registration cohort view of the editor community: grouping users by the year or month they started editing,
    • show the relative size of each group
    • show historic trends for these groups
    • show retention rate over time (ie. how many of the editors registered in year X are still active in year Y)
    • some combination of this with the groups from the funnel (

Architecture requirements

  • The portal is to be hosted on Toolforge (Wikimedia's platform-as-a-service). It should use the data from the replica of the wiki database and cache the results (and probably prime the cache with a periodic job; depends on how expensive the queries turn out to be.)
    • (Stretch goal: use data from the edit history reconstruction project. This contains information about the context of editor actions (e.g. how many edits did the user have when they made the given edit) but is currently not publicly available and is hosted on a dedicated set of servers, so you'll need to create a job that runs there, extracts the relevant information and makes it available for the portal. Most of the features above do not require this.)
  • The exact set of reports available on the portal will need to be changed frequently even after the end of the internship so it should be written in a flexible way where the building blocks of reports can be easily reconfigured.
  • The portal should be written with reuse on other wikis in mind: specifics of the database should be abstracted away to the extent possible, and it should support internationalization (translation, date/number formats etc)

Event Timeline

@srishakatux , @Tgr I have been contributing to WikiEduDashboard for a while now and recently stumbled on this task , read about it , it seems like a big project as mentioned on the details page scheduled to run from March-19 to Feb-20 , I would like to contribute to this project (either in GSOC 19 or apart from GSOC 19 would be fine too) though I'm not familiar with Hungarian but I do have other 'Applicant Requirements' to get started!

Welcome @Hjhimanshu! I'm at a conference, will follow up over the weekend.

@Tgr , Great , would like to discuss details on IRC if possible?

@Hjhimanshu sorry for taking so long to respond, things got busier at WMCON than expected. Not considering that it's in the middle of the GSoC application period was a planning fail on my side :/

You are hjhimanshu01 at Github I imagine?

My IRC nick is tgr on Freenode (or tgr_ or tgr|away sometimes, which means I probably won't respond for a while but will see the message eventually).

Do you have an idea what technology you would use for the project? Also, any ideas on what small bugs to fix (which is a part of the application process)? Since the project is about a tool which is yet to be written, we need to find bugs somewhere else; it's easier if it's something you are already familiar with or at least interested in. One option would be writing some Hungarian Wikipedia related SQL queries in our query runner, but would be nice to find a web application related coding task as well.

@Tgr I would like to take this project up. I am proficient in Python. I would prefer to use either Flask or Django for this project.
As for the coding tasks, I think a prototype of the project might be a good starting point.
I would start with writing the proposal for this along with the quick prototype of this project.

@Tgr , exactly that's my github profile , also I'm proficient with python / Node as well as PHP , contributions / bug fixes to a webApp (WikiEduDashboard) include these though these might not be much relevant here:


Also I'm willing to contribute to project for extended period of time apart from GSOC 19 as this project requires almost an year and the scope of the project also includes survey with the existing editors + their ideas so probably after GSOC and developing a portal on Toolforge , I think I would like to extend my help with that. I have read about the research conducted on possible solutions for retention and growing of wiki Hungarian community , one of the article link is probably not up on the details page (
.Though would like to discuss more apart from this , will certainly ping on IRC.

@Tgr , follow up of the conversation , I think Kubernetes as backend with Node as the scripting engine , and Front-end as preferred (possibly React ?) apart from that , PHP along with Node could be used along with Symphony as templating engine ? Any suggestions about which one to use?

Is this Task T223892 open to contribution?

Is there anything remaining in this task from GSoC'19? If not, then please consider marking it as resolved! If yes, and would need another GSoC or volunteer help then consider creating a new task with the leftover items. Thanks!

Some other possible metrics that might be worth considering:

  • standard WMF metrics such as activation and retention
  • mobile as a possible dimension to split things by
  • number of thanks received
  • revert rate (this used to be hard but the ongoing GSoC project T254074: Implement the reverted edit tag makes it quite simple)

Funnel view (as first module) of the statistical toolset is realized and available for huwiki:
Adding other wikis and/or configure different funnels is possible on request.

(Just to avoid confusion: this is being done now with the help of a grant, not as a GSoC project.)