Page MenuHomePhabricator

Develop an insightful, data-driven clustering of wikis
Open, LowPublic

Description

There is currently no standard framework for clustering Wikimedia's many wikis. Individual people can and do group them by their project family, or number of active editors, or language, but trying to account for more than a single dimension increases the difficulty exponentially. This makes it difficult to effectively grapple with the rampant diversity of our 824 public-facing wikis.

To address this difficulty, we should produce a set of standard, well-discussed wiki-clusters taking into account as many dimensions as possible.

Some work has already been done in the wikimedia-research/wiki-segmentation repo on GitHub: code in the "data-collection" folder generates an initial set of dimensions, while code in the "clustering-initial" folder generates a first-pass clustering.

Good steps for continuing this work would be:

  • Investigate ways to quantify the internal social dynamics among wiki contributors, since this is probably the biggest gap in the data already collected
  • Consider ideas from the "possible new dimensions" tab of the wiki segmentation spreadsheet
  • Eliminate wikis that are totally or mostly inactive from the clustering
  • Consider removing highly correlated dimensions from the clustering (possibly by converting absolute numbers to ratios; for example, converting "new active editors" to "new active editors per active editor").

Key stakeholders

  • Audiences product managers
  • Design Research team
  • Trust and Safety team
  • Community Relations team

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf updated the task description. (Show Details)
MBinder_WMF removed a project: Epic.
MBinder_WMF moved this task from Triage to Next Up on the Product-Analytics board.

I've put together the results of the much, much clustering that I did into https://github.com/wikimedia-research/wiki-segmentation/tree/master/clustering-initial/deliverable

We have a meeting scheduled to discuss these results and then it'll be up to the rest of the folks to figure out which clustering they want to use until the data is iterated on and ready to be re-clustered (by me! :D)

@Neil_P._Quinn_WMF fair to call this task resolved?

mpopov moved this task from Doing to Blocked on the Product-Analytics board.
mpopov added subscribers: kzimmerman, mpopov.

@kzimmerman, @Neil_P._Quinn_WMF, and I need to schedule a discussion with @MNovotny_WMF to follow-up on the results

nshahquinn-wmf moved this task from Blocked to Next Up on the Product-Analytics board.

@kzimmerman and I talked and came up with a tentative plan for me to take the lead on this again in the new year.

I want to do some simple tweaks to Mikhail's clustering (removing inactive wikis and fixing some broken data) and then organize that meeting to review the current clusters.

nshahquinn-wmf renamed this task from Construct and personify wiki clusters [segmentation phase 3] to Develop an insightful, data-driven clustering of wikis .Apr 19 2019, 1:31 AM
nshahquinn-wmf raised the priority of this task from High to Needs Triage.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf removed a subscriber: Tbayer.
nshahquinn-wmf moved this task from Next Up to Icebox on the Product-Analytics board.

I've updated this task to reflect the fact that we have decided to deprioritize it. We (sadly) won't be doing it during the current fiscal year, although we do hope to take it up at some point in the future when we have more capacity.

RHo subscribed.

removing inactive project tag.

mpopov added a subscriber: OSefu-WMF.

Declining this task for now.

@OSefu-WMF: For what it's worth I think this is a good idea that Movement-Insights may want to pick up at some point (maybe 1-2 FYs from now?) to enhance high-level metric reporting and it could be beneficial to trend/pattern extraction & forecasting, but it would need a new set of stakeholders (since the main ones have all left the organization).

This task has quite a bit of useful information for the distant day when we do this 😅