Page MenuHomePhabricator

AB Test Logic for Machine Assisted Article Descriptions
Closed, ResolvedPublic

Description

The team is working on Machine Assisted Article Descriptions and will conduct an A/B Test over 30 days from the release date. With a check in by the Data Scientist at the 15 mark to see if we are on track to meet our goal.
During initial evaluation of the model, it was preferred more than 50% of the time over human generated article descriptions. Additionally, Descartes held a 91.3% accuracy rate in testing. However it was tested with a much smaller audience.

Research Questions

In running this A/B test, we are seeking to learn whether introducing Machine Assistance to Article Description will cause the following to happen?

  1. New users produce higher quality article descriptions when exposed to suggestions
  2. Prior performance of the model holds up across mBART25 Languages when exposed to more users
  3. Stickiness of task increases by 5%

Decision to be made

This A/B test will help us make the following decision:

  • Expand the feature to all users
  • Use suggestion as a means to train new users and remove 3 edit minimum gate
  • Migrate model to more permanent API
  • Show 1 or 2 beams
  • Expand to MBart 50

ABC Logic Explanation

  • Experiment will include only logged in users, in order to stabilize distribution.
  • The only users that will see the suggestions are those in mBART25
  • Of those in mBART25 half will see suggestions (B: Treatment) and half will not see suggestions (Control)
  • Of those in mBART25 only users that have more than 50 edits can see suggestions for Biographies of a Living Person, and if the users are in the non-Blp group, they will remain in it, even if they cross 50 edits during the experiment.

Additionally, we care about how the answers to our experiment will differ by language wiki and user experience (<50 New vs. 50+ Experienced).

Decision Matrix

  • If the accuracy rate for edits that came from the suggestion is less than those manually written, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
  • If the accuracy rate for edits that came from the suggestion is less than 80%, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
  • If the time spent to complete the task using the suggestion is double the average rate as those that do not see suggestions we will need to compare it to reports to see if there are performance issues
  • If time spent to complete the task using the suggestion is less than the average without a negative impact to accuracy rate, we will consider it a positive indicator to expand the feature to more users
  • If users that see the suggestion modify the suggestions more often than submitting it without modification, we will evaluate its accuracy rate compared to users that did not see the suggestions and determine if the suggestion is a good starting point for users and how it differs by user experience
  • If users that see the suggestion modify the suggestions more often than submitting them without modification, we will look for trends in the modification and offer a recommendation to EPFL to update the model
  • If beam one is chosen more than 25% of the time than beam two while having an equal or higher accuracy rate, we will only show beam one in the future
  • If users that see treatment return to the task multiple times (1,2,7,14 days) at a rate 15% or more than the control group without a negative impact to accuracy, we will take steps to expand the feature
  • If our risks are triggered we will implement our contingency plan
  • If users that see the treatment do not select a suggestion more than 50% of the time after viewing the suggestions, we will not expand the feature

Wikis

//This section will contain the list of wikis participating in the A/B test.

NOTE: In aggregate, there should be at least 1500 people with a stretch goal of 2,000 people and 4,000 edits included in the A/B test across the following mBART25 wikis: English, Russian, Vietnamese, Japanese, German, Romanian, French, Finnish, Korean, Spanish, Chinese (sim), Italian, Dutch, Arabic, Turkish, Hindi, Czech, Lithuanian, Latvian, Kazakh, Estonian, Nepali, Sinhala, Gujarati, and Burmese.

Done

  • A report is published that evaluates the `Decision Matrix above, Risk Register, and experiment questions in the concept deck

Reference materials
Context Deck
Project Page

Event Timeline

@JTannerWMF @scblr Now that our analytics are opt-in (i.e. turned off) by default upon install, have we thought about how we'll ensure that we'll get a sufficient amount of data from the test? e.g. Should we be building an explicit opt-in switch in the onboarding screens (T331272)?

@Dbrant @SNowick_WMF @JTannerWMF, do we have before/after opt-in metrics for the design changes in T326173?

From a UX perspective, it's not recommended to ask users to opt-in twice, so I'm curious about what the data tells us.

Looking at unique users going through Onboarding Funnel daily with the new screen events (accept/reject) the daily average acceptance rate is is 88.7% Data (Data is for 17 days, version 2.7.50431-r-2023-02-22 only.)

For comparison - our prior funnel had share acceptance set to accept by default, we can approximate a quasi acceptance rate by subtracting the 'unique users who opted out of sharing' from 'uniques who completed the funnel' over 'uniques who started the funnel' which comes to 82.5% . (See deck)

Hi @JTannerWMF ,

If the user had edit count <50 and gets assigned to a group and they continue to keep making edits during the experiment and at some point cross 50, should we bump them to the blp group or keep initial assignments?
My assumption is that in a true ABC test sense, the assignments shouldn't change mid-experiment, but please let me know.

Updating here what we discussed offline - The users shouldn't switch groups mid experiment. So, Group2 users remain in that group even if they cross 50 edits. They will not get Bumped-up

Adding a reminder that beams should be randomized on the client side but still let Shay know which Beam was chosen

@JTannerWMF validating that the API order and Display order are indicated in the event stream for the edit success event so we will be able to track chosen Beam based on both original API order and Display order. See spreadsheet for event to OKR validation based on latest code.

After a short description with @JTannerWMF decided to restrict the experiment to only logged in users in order to stabilize user group assignment. Updated the task desc with it.
"Logged in users editing article descriptions through SE feed or through articles which do not have an exiting desc [if it is a change to existing desc we do not show machine gen suggestions] are in the experiment"