There is currently no standard framework for clustering Wikimedia's many wikis. Individual people can and do group them by their project family, or number of active editors, or language, but trying to account for more than a single dimension increases the difficulty exponentially. This makes it difficult to effectively grapple with the rampant diversity of our 824 public-facing wikis.
To address this difficulty, we should produce a set of standard, well-discussed wiki-clusters taking into account as many dimensions as possible.
Some work has already been done in the wikimedia-research/wiki-segmentation repo on GitHub: code in the "data-collection" folder generates an initial set of dimensions, while code in the "clustering-initial" folder generates a first-pass clustering.
Good steps for continuing this work would be:
- Investigate ways to quantify the internal social dynamics among wiki contributors, since this is probably the biggest gap in the data already collected
- Consider ideas from the "possible new dimensions" tab of the wiki segmentation spreadsheet
- Eliminate wikis that are totally or mostly inactive from the clustering
- Consider removing highly correlated dimensions from the clustering (possibly by converting absolute numbers to ratios; for example, converting "new active editors" to "new active editors per active editor").
- Audiences product managers
- Design Research team
- Trust and Safety team
- Community Relations team