Page MenuHomePhabricator

SuggestBot Experimentation
Closed, ResolvedPublic

Description

Goal: design and implement an experiment to explore editor willingness to balance personalization with content equity in edit recommender systems.

Subtasks

This is an attempt at a semi-complete listing -- we may choose to decline some of these steps:

Documentation

Offline Analysis

Experiment

  • SuggestBot onboarding -- get code running fully
  • Power analysis: should be able to see small effects even with three experimental conditions across several months on enwiki
  • Experiment design
  • Get experimental code running on test instance
  • Offline analysis of simulation of experiment on test instance to verify assumptions / power analysis
  • Design review w/ Morten + Loren
  • Coordinate with Morten for keeping Mo's fork and main branch in alignment so deploy/undeploy is easy
  • Coordinate with Morten/Isaac to be maintainers for the code in case of issues
  • Dry-run of code on SuggestBot official where we log what we would have done but don't change any recommendations for a week or two to make sure matching what we saw with Mo's fork (we scratched this in favor of careful roll-out)
  • Deployment to small pilot group to make sure no errors
  • Deployment to all for at least one month and ideally a few months (decide in advance how many rec sets to cut off at)
  • IRB approval
  • Survey/Interviews (next project)

Event Timeline

Weekly updates:

  • Got SuggestBot code (mostly) fully running and generating recommendations. Need access to VM running live SuggestBot to do the rest. (Asked Morten for access, which will be granted.)
  • Implemented re-ranking algorithm in SuggestBot. Still needs logic for assigning users to experimental groups. (Waiting on access to SuggestBot VM for this as well.)
  • Retrieved data for re-run of offline analyses. New analyses currently in progress.

Weekly updates:

  • Base SuggestBot code is fully functional on VM.
  • Power analysis completed. Conclusion: getting enough data for testing our hypotheses should be relatively trivial.
  • SuggestBot offline analyses re-done and new supplementary analysis completed. Currently thinking about potential modifications to these analyses.
  • (Tentatively) decided on high-level experimental design: three groups - control, explicit feature-based, campaign/wikiproject-based. Randomize experimental group per recommendation set (rather than assigning users to separate experimental groups).
  • Currently debugging re-ranking algorithm code and sanity-checking the output. Next step after that is offline analysis of outcomes of re-ranking (i.e., looking at how recs are different after re-ranking).

Weekly updates:

  • Spent a good amount of time thinking about the logic of the various parameters in Sonboli et al.'s re-ranking algorithm (esp. feature weights). Had a chat with Nasim Sonboli where she confirmed my intuitions about directionality of feature weights.
  • Decided to go with a more heavily modified version of the Sonboli et al. algorithm.
  • Still conducting offline analysis of re-ranking outputs. I've tried various weighting schemes, none of which have been great (very little effect on recs' composition), but am on the right track to finding reasonable algorithm parameters for the experiment by early next week.

Weekly updates:

  • After some discussion, we decided to go with a much simplified algorithm for re-ranking items. The current implementation basically has a random chance (50%) of adding an additional filtering criterion (e.g., article must be a biography of a woman, or article must pertain to the global south).
  • Did some testing of the aforementioned filtering. Had to decide on an appropriate fallback for when an article that meetings the additional filtering criteria cannot be found. Tested 1) falling back to recommending a random article that meets all filter criteria and 2) falling back to the normal, less stringent, filtering criteria. Decided on the latter.
  • With the above completed, we have a (mostly) settled experimental design. The one big missing part is a filtering feature that is tied to campaigns as opposed to explicit criteria like gender and region. Brainstorming possibilities here and will test any plausible approaches.

Weekly updates:

  • Selected new filtering criterion that is tied to campaigns. An article is a candidate for the "equity" experimental group if it is in one of 11 equity-based Wikiprojects (Disability, Politics, Agriculture, Medicine, Education, Water, Sanitation, Energy, Environment, Climate change, Human rights).
  • Finalized experimental design. Presented on it to the Research team and to Loren (my PhD adviser). Overall feedback was very positive, so we're good to move forward!
  • Spent some time tweaking the probabilities for selection into the various experimental groups. They are now set at 20% chance of baseline, 34.3% gender, 22.9% geography, 22.9% equity Wikiproject. Failing to generate a recommendation for one of the treatment groups results in a fallback to baseline recommendation. This results in an actual breakdown of ~50% baseline and ~15-17% for each of the other groups when we generate recommendations (based on 840 recommendations generated for a random subset of 10ish users).
  • Meeting with Morten next Tuesday to coordinate on deployment of experiment.
  • Did additional power analyses based on Monte Carlo simulations. High-level conclusion is that ~3 months of data collection should be sufficient.

Weekly updates:

  • Decided on a plan for deployment of experiment: start with code review, complete "dry run" on fresh VPS instance, deploy to live in limited capacity (10ish users), then full deployment.
  • Code review completed.
  • Dry run on fresh VPS instance was successful. Wrote up checklist and bash scripts to make full deployment relatively seamless.
  • Limited deployment scheduled for sometime early next week (probably Monday or Tuesday).
  • @nettrom_WMF and I met last week to set up a limited deployment. I am currently getting modified/experiment recommendations and all other users are getting the normal recommendations. As of right now, there are no issues and SuggestBot is functioning as intended.
  • IRB asked for one final change to the protocol. I resubmitted this morning, so we can probably expect approval by early next week. Full deployment of the experiment will happen ASAP after that.

We launched the experiment on August 29, 2022. On September 7, 2022, we updated the experiment to also include single-request users (not just regular subscribers). We have been monitoring SuggestBot to make sure everything is working as intended, and have noticed no issues so far. The experiment is expected to run for approximately 3 months. At the 3-month mark, we will see how many recommendation sets have been served and plug that into the power analysis code I set up before the experiment. If we determine the sample size to be large enough, we will stop the experiment. Otherwise, we will update here with a new timeline.

Just noting that the experiment was turned off January 9th -- we'll have to wait several weeks to start the analyses though.

Isaac updated the task description. (Show Details)

Paper submitted -- going to resolve the task as any further revisions etc. hopefully will not be major.