Page MenuHomePhabricator

Measure equity impact of current recommender systems
Closed, ResolvedPublic

Description

Overview

Goal: for at least two Wikimedia content recommender systems, do a full pipeline analysis of their impact on content equity.

Details / Definitions

Wikimedia content recommender systems: I have been compiling a table of recommender systems here. Initially I will focus on recommender systems where analysis is feasible as I develop the analysis process though eventually this will hopefully be applied to all of the recommender systems.

Content equity: content refers to articles on Wikipedia, images on Commons, items on Wikidata, etc. There are many ways to define gaps / equity (see taxonomy) but for this work I focus on two: gender and geography. The simplest metrics for these are:

  • Gender: % of biographies that are men vs. women or non-binary genders. This is most directly established by linking a piece of content to a Wikidata item and checking instance-of (P31) to be human (Q5) and then recording the gender (P21) property.
  • Geography: distribution of content by relevant country. There are a variety of ways to link content to a country. The most direct is also via Wikidata to use a variety of geography-related properties (example).

Full pipeline analysis: study content equity at several stages:

  • Baseline: current distribution of content in the project -- e.g., % of biographies on English Wikipedia by gender.
  • Candidates: distribution of content for items eligible for recommendations -- e.g., if a recommender system only recommends stub articles, this might be % of stub biographies on English Wikipedia by gender. Note that this may be the same as baseline in many cases.
  • Recommendations: distribution of content in what articles/items/images are recommended to users --- e.g., if the recommender system ranks content by pageviews, this might look at the % of biographies by gender for the top-k candidates . Note that this may be the same as candidates if the recommender system serves up random content and doesn't apply any ranking criteria over top.
  • Edits: distribution of content that is actually edited via the recommender system. Editors don't necessarily follow all recommendations and e.g., might introduce bias towards one type of biography.

Impact on content equity: The impact part of this is tough. For many projects, it's abundantly clear that there is a bias towards men and North America / Europe on wiki. For this project, I won't explicitly define desired end-states of gender / geography but I'll measure positive impact on equity as an increase in the diversity -- i.e. more uniform distribution of gender in biographies and geography. Given that it's an analysis of a moment in time (as opposed to strategy for future recommender systems), this should be sufficient. At each stage in the pipeline above, the content distribution will be compared with the baseline to see if it pushes the project towards a more or less diverse distribution of content for the aspects of equity being analyzed.

Results

I will document progress here: https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Recommendation

Event Timeline

weekly update: no progress though the start of the Outreachy project on a country classifier for articles will help greatly with the geographic equity component of this work (T263646)

weekly update:

  • wrote up notebook for gathering data on how the Suggested Edits module has actually been used. This will complement the analysis of what types of recommendations are made and indicate whether there is any bias towards skipping recommendations along gender / geography lines. For instance, since May 2020 (v4 of suggested edits), there have been 28,331 edits made via the module to images that have associated Wikipedia articles (and therefore I can directly infer gender / geography associated with those images)
  • TODO: read through results from Growth experiments to help guide impact analysis of that module

weekly update:

weekly update:

  • Equity impact complete for Suggested Edits. Summary:
The analyses demonstrated that depending largely on a random selection of content for recommendation reinforces the status quo around gender and geography -- i.e. heavy imbalance towards men, the United States, and United Kingdom -- and therefore the net effect of the recommender is to improve content about men more than women or other gender identities and content about the US/UK more than other regions. The exact regions improved depends heavily on language -- i.e. US/UK for English Wikipedia but Japan for Japanese Wikipedia or Germany for German Wikipedia -- but the trend remains that editors do not themselves seem to exert additional selection bias over the recommendations. Analogously, the gender associated with the content recommended does not seem to affect whether editors choose to make an edit or not.
  • I still would like to repeat this analysis on another recommender system. I'll likely go with Newcomer tasks because there is good data collection, it's been used quite a bit in a number of languages outside of English, and it has the added variable of maintenance templates (which might skew the potential recommendations) and topic preferences (which might skew what recommendations are actually shown). So while Suggested Edits ended up being a pretty straightforward story (biased content -> biased recommendations -> biased edits), Newcomer Tasks might be more complicated.