Page MenuHomePhabricator

Build statistics toolset to support WM-HU editor retention grant
Open, Needs TriagePublic

Description

Wikimedia Hungary is kicking of a one-year project to improve editor retention in the Hungarian Wikipedia (supported by a WMF grant). See details here. Doing this effectively will require much more detailed statistics than currently available. Build a contributor statistics portal for the Hungarian Wikipedia that displays statistics and editor lists relevant for the program.

Functional requirements

The portal should provide the following information:

  • a "funnel" view of the Hungarian Wikipedia community: given a number of user categories like users with 1-10 edits, users with 100+ edits in the last 30 days, administrators, users who did more than 10 reviews in the last 30 days etc.
    • show the size of each group
    • show the transitions between the groups (the number of editors who moved from one group to another in some given time frame)
    • (stretch goal) show historic trends for these groups
  • (stretch goal) where it makes sense, the same statistics with the number of edits instead of the number of editors (number of edits coming from editors with 1-10 edits etc).
  • lists of editors who are potential targets for intervention: transitioning from one group to another (e.g. recently registered editors; editors who have recently stopped participating), made some achievement and no one followed up yet (e.g. recently registered editors who have not been welcomed yet, recently reached their 1000th edit and not congratulated yet), had some negative interaction (e.g. first edit reverted)
    • (stretch goal) annotate lists with data pulled in from other sources (such as the ORES edit scoring service, or the review API) to identify users who are special in some way (e.g. well-intentioned but struggling with wiki syntax, or stuck in the review queue)
    • an API to expose this information in a machine-readable way
    • export in whatever format is convenient to the people who will follow up on these lists (e.g. wikitable or CSV)
  • top lists of editors who perform a certain task (e.g. administrative actions, edit reviews, template edits) plus ratio of the total amount of tasks they perform
    • an API to expose this information in a machine-readable way
    • export in whatever format is convenient to the people who will follow up on these lists (e.g. wikitable or CSV)
  • where it makes sense, support filtering / splitting results on manually provided username lists (this will be used to assess the effectiveness of interventions)
  • (stretch goal) a registration cohort view of the editor community: grouping users by the year or month they started editing,
    • show the relative size of each group
    • show historic trends for these groups
    • show retention rate over time (ie. how many of the editors registered in year X are still active in year Y)
    • some combination of this with the groups from the funnel (

Architecture requirements

  • The portal is to be hosted on Toolforge (Wikimedia's platform-as-a-service). It should use the data from the replica of the wiki database and cache the results (and probably prime the cache with a periodic job; depends on how expensive the queries turn out to be.)
    • (Stretch goal: use data from the edit history reconstruction project. This contains information about the context of editor actions (e.g. how many edits did the user have when they made the given edit) but is currently not publicly available and is hosted on a dedicated set of servers, so you'll need to create a job that runs there, extracts the relevant information and makes it available for the portal. Most of the features above do not require this.)
  • The exact set of reports available on the portal will need to be changed frequently even after the end of the internship so it should be written in a flexible way where the building blocks of reports can be easily reconfigured.
  • The portal should be written with reuse on other wikis in mind: specifics of the database should be abstracted away to the extent possible, and it should support internationalization (translation, date/number formats etc)

Applicant requirements

  • familiarity with PHP (preferred), Python or Node.js
  • familiarity with SQL
  • familarity with MediaWiki's API and/or database schema is a plus but not required
  • familiarity with Hungarian Wikipedia and being able to speak Hungarian are a plus but not required

GSoC information

Generic information: Google guide, Wikimedia guide, GSoC homepage
Primary mentor: @Tgr (mainly focusing on technology) - contact info
Secondary mentor: @Samat (mainly focusing on product requirements)

Event Timeline

Tgr created this task.Mar 14 2019, 2:31 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 14 2019, 2:31 AM
Tgr updated the task description. (Show Details)
Tgr moved this task from Backlog to Huwiki on the User-Tgr board.Mar 18 2019, 7:33 AM
Samat added a subscriber: Samat.Mar 18 2019, 3:12 PM
Samat updated the task description. (Show Details)Mar 18 2019, 3:16 PM

@srishakatux , @Tgr I have been contributing to WikiEduDashboard for a while now and recently stumbled on this task , read about it , it seems like a big project as mentioned on the details page scheduled to run from March-19 to Feb-20 , I would like to contribute to this project (either in GSOC 19 or apart from GSOC 19 would be fine too) though I'm not familiar with Hungarian but I do have other 'Applicant Requirements' to get started!

Tgr added a comment.Mar 29 2019, 10:55 PM

Welcome @Hjhimanshu! I'm at a conference, will follow up over the weekend.

Hjhimanshu added a comment.EditedMar 31 2019, 5:03 AM

@Tgr , Great , would like to discuss details on IRC if possible?

Tgr added a comment.Apr 2 2019, 7:19 PM

@Hjhimanshu sorry for taking so long to respond, things got busier at WMCON than expected. Not considering that it's in the middle of the GSoC application period was a planning fail on my side :/

You are hjhimanshu01 at Github I imagine?

My IRC nick is tgr on Freenode (or tgr_ or tgr|away sometimes, which means I probably won't respond for a while but will see the message eventually).

Do you have an idea what technology you would use for the project? Also, any ideas on what small bugs to fix (which is a part of the application process)? Since the project is about a tool which is yet to be written, we need to find bugs somewhere else; it's easier if it's something you are already familiar with or at least interested in. One option would be writing some Hungarian Wikipedia related SQL queries in our query runner, but would be nice to find a web application related coding task as well.

Usmanmuhd added a subscriber: Usmanmuhd.EditedApr 3 2019, 1:48 AM

@Tgr I would like to take this project up. I am proficient in Python. I would prefer to use either Flask or Django for this project.
As for the coding tasks, I think a prototype of the project might be a good starting point.
I would start with writing the proposal for this along with the quick prototype of this project.

@Tgr , exactly that's my github profile , also I'm proficient with python / Node as well as PHP , contributions / bug fixes to a webApp (WikiEduDashboard) include these though these might not be much relevant here:

  1. https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/2641
  2. https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/2640
  3. https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/2604
  4. https://github.com/WikiEducationFoundation/WikiEduDashboard/pull/2550

Also I'm willing to contribute to project for extended period of time apart from GSOC 19 as this project requires almost an year and the scope of the project also includes survey with the existing editors + their ideas so probably after GSOC and developing a portal on Toolforge , I think I would like to extend my help with that. I have read about the research conducted on possible solutions for retention and growing of wiki Hungarian community , one of the article link is probably not up on the details page (https://meta.wikimedia.org/wiki/Research:Alternative_life_cycles_of_new_users)
.Though would like to discuss more apart from this , will certainly ping on IRC.

@Tgr , follow up of the conversation , I think Kubernetes as backend with Node as the scripting engine , and Front-end as preferred (possibly React ?) apart from that , PHP along with Node could be used along with Symphony as templating engine ? Any suggestions about which one to use?

Hsync7 added a subscriber: Hsync7.Apr 9 2019, 6:36 PM

@Tgr
Is this Task T223892 open to contribution?

putnik added a subscriber: putnik.Sun, Aug 18, 3:55 PM