Page MenuHomePhabricator

Develop an insightful, data-driven clustering of wikis
Open, Needs TriagePublic


There is currently no standard framework for clustering Wikimedia's many wikis. Individual people can and do group them by their project family, or number of active editors, or language, but trying to account for more than a single dimension increases the difficulty exponentially. This makes it difficult to effectively grapple with the rampant diversity of our 824 public-facing wikis.

To address this difficulty, we should produce a set of standard, well-discussed wiki-clusters taking into account as many dimensions as possible.

Some work has already been done in the wikimedia-research/wiki-segmentation repo on GitHub: code in the "data-collection" folder generates an initial set of dimensions, while code in the "clustering-initial" folder generates a first-pass clustering.

Good steps for continuing this work would be:

  • Investigate ways to quantify the internal social dynamics among wiki contributors, since this is probably the biggest gap in the data already collected
  • Consider ideas from the "possible new dimensions" tab of the wiki segmentation spreadsheet
  • Eliminate wikis that are totally or mostly inactive from the clustering
  • Consider removing highly correlated dimensions from the clustering (possibly by converting absolute numbers to ratios; for example, converting "new active editors" to "new active editors per active editor").

Key stakeholders

  • Audiences product managers
  • Design Research team
  • Trust and Safety team
  • Community Relations team

Event Timeline

Neil_P._Quinn_WMF triaged this task as High priority.
Neil_P._Quinn_WMF updated the task description. (Show Details)
MBinder_WMF reassigned this task from Neil_P._Quinn_WMF to mpopov.
MBinder_WMF moved this task from Triage to Next Up on the Product-Analytics board.

I've put together the results of the much, much clustering that I did into

We have a meeting scheduled to discuss these results and then it'll be up to the rest of the folks to figure out which clustering they want to use until the data is iterated on and ready to be re-clustered (by me! :D)

@Neil_P._Quinn_WMF fair to call this task resolved?

mpopov moved this task from Next Up to Doing on the Product-Analytics board.Oct 17 2018, 2:12 PM
mpopov removed mpopov as the assignee of this task.Dec 10 2018, 2:41 PM
mpopov moved this task from Doing to Stalled on the Product-Analytics board.
mpopov added subscribers: kzimmerman, mpopov.

@kzimmerman, @Neil_P._Quinn_WMF, and I need to schedule a discussion with @MNovotny_WMF to follow-up on the results

Neil_P._Quinn_WMF claimed this task.EditedDec 24 2018, 6:30 AM
Neil_P._Quinn_WMF moved this task from Stalled to Next Up on the Product-Analytics board.

@kzimmerman and I talked and came up with a tentative plan for me to take the lead on this again in the new year.

I want to do some simple tweaks to Mikhail's clustering (removing inactive wikis and fixing some broken data) and then organize that meeting to review the current clusters.

Neil_P._Quinn_WMF renamed this task from Construct and personify wiki clusters [segmentation phase 3] to Develop an insightful, data-driven clustering of wikis .Apr 19 2019, 1:31 AM
Neil_P._Quinn_WMF raised the priority of this task from High to Needs Triage.
Neil_P._Quinn_WMF updated the task description. (Show Details)
Neil_P._Quinn_WMF removed a subscriber: Tbayer.
Neil_P._Quinn_WMF removed Neil_P._Quinn_WMF as the assignee of this task.Apr 19 2019, 1:35 AM

I've updated this task to reflect the fact that we have decided to deprioritize it. We (sadly) won't be doing it during the current fiscal year, although we do hope to take it up at some point in the future when we have more capacity.