Page MenuHomePhabricator

Conduct analysis for Alt Text experiment 15 days and 30 days after experiment start
Open, LowPublic

Description

Background
  • Release date: September 5th, 2024 - 7.5.8 (3979) and onwards
  • 15 Days- September 20, 2024
  • 30 Days- October 5, 2024
  • 60 Days- Nov 4, 2024
The Task
  • Compare results for experiment groups to control group data
  • Visualize and present the data in a way that is easily understandable to the team
Requirements
  • The data should be based on the metrics in the Epic
At 15 days:

Check metric-specific leading indicators:

  1. 100 edits with alt text values, from at least 25 unique editors. At least 25 edits are from newer editors
  2. More than 15 unique editors have been assigned to each experiment group
  3. 70% task acceptance rate for group B, at least 10% acceptance rate for group C (# of people who enter the flow / impressions of prompt)
  4. Revert rate for newer editors edits in any single group does not exceed 18%
  5. Add data into Grading Sheet
At 30 days (September 5 - October 5, 2024)

Measure KRs that require control group:

  • Edit return rate of editors in group B or C who have received an Alt text prompt does not differ from controls by more than 10%
  • Revert rate for edits from experiment groups does not exceed controls by more than 5 percentage points

Measure curiosities:

  • What is the most common reason that users decide not to act on the prompt? (Survey responses)
  • Did the overall constructive activation rate in the iOS app increase when we made Image recommendations available to brand new editors?
  • Add data into Grading Sheet
  • Pull matching data to this tab from image recs edits during same time period, for es, pt, zh, and fr (September 5 - October 5, 2024)
At 60 days:

Key Indicators

  1. 60% of group B editors publish an additional edit with alt text for the image they were prompted on
  2. 4% of group C users add alt text when prompted after editing an article
  3. Of group B editors who saw treatment and then make a subsequent image recommendations edit in the next 15 days, 25% add alt-text as a part of that edit
  4. 200 images are improved with Alt text, by at least 50 unique editors
  5. Add data into Grading Sheet
  6. Pull matching data to this tab from image recs edits during same time period, for es, pt, zh, and fr (September 5 - November 4, 2024)

Decision Matrix for next steps:

  1. If 71% of edits are scored a 3 or higher we will scale the feature. If less than 70% of edits are scored a 3 or higher we will improve guidance or use AI to better assist users.
  2. If quality scores for newer editors are more than 50% worse than quality scores for experienced editors, we will not recommend this task be available to newer editors.
  3. If we see at least 60% say they would use feature that provided a feed of images in need of alt text, then we will have the confidence to pursue a feed of alt-text suggested edits
  4. If 60% or more of respondents say they would be interested in similar edit notifications for articles they are working on, and 60% of respondents are satisfied with the feature (Group C survey responses), we share this information and consider future edit prompts.

Guardrails

  1. Edit return rate of editors in group B or C who have received an Alt text prompt does not differ from controls by more than 10%
  2. Revert rate for edits from experiment groups does not exceed controls by more than 5 percentage points
  3. Human-graded* or actual revert rate for newer editors in experiment groups does not exceed 18%
  4. Alt text task completion rate for newer editors is above 25% (Completion rate = number of alt text edits published / those who said “yes” to the prompt and started the flow)

Curiosities (nice to have)

  1. Did the overall constructive activation rate in the iOS app increase when we made Image recommendations available to brand new editors?
  2. How does the task completion rate, return rate, and revert rate for newer editors’ alt text edits compare with that of experienced editors? With that of comparable rates from Growth suggested edits?
  3. How does the human-graded revert rate compare to Android's Image Captions Suggested Edit?
  4. Is there a difference in Number of alt text edits & Unique editors by language and geography? (For example, breaking down edits from Latin America vs Europe for Spanish)
  5. What is the most common reason that users decide not to act on the prompt?

*Note: for the quality scores and human-graded revert scores, we will partner with an accessibility organization who will be reviewing and grading alt text.

{1} Definition of newer editors: Editors who had fewer than 10 edits on that wiki they are currently editing at the point they entered the experiment

Event Timeline

30 Day Analysis Data. Constructive activation metric on hold until mediawiki_history is rolled up for 2024-10.