The team is working on Machine Assisted Article Descriptions and will conduct an A/B Test over 30 days from the release date. With a check in by the Data Scientist at the 15 mark to see if we are on track to meet our goal.
During initial evaluation of the model, it was preferred more than 50% of the time over human generated article descriptions. Additionally, Descartes held a 91.3% accuracy rate in testing. However it was tested with a much smaller audience.
Research Questions
In running this A/B test, we are seeking to learn whether introducing Machine Assistance to Article Description will cause the following to happen?
- New users produce higher quality article descriptions when exposed to suggestions
- Prior performance of the model holds up across mBART25 Languages when exposed to more users
- Stickiness of task increases by 5%
Decision to be made
This A/B test will help us make the following decision:
- Expand the feature to all users
- Use suggestion as a means to train new users and remove 3 edit minimum gate
- Migrate model to more permanent API
- Show 1 or 2 beams
- Expand to MBart 50
ABC Logic Explanation
- Experiment will include only logged in users, in order to stabilize distribution.
- The only users that will see the suggestions are those in mBART25
- Of those in mBART25 half will see suggestions (B: Treatment) and half will not see suggestions (Control)
- Of those in mBART25 only users that have more than 50 edits can see suggestions for Biographies of a Living Person, and if the users are in the non-Blp group, they will remain in it, even if they cross 50 edits during the experiment.
Additionally, we care about how the answers to our experiment will differ by language wiki and user experience (<50 New vs. 50+ Experienced).
Decision Matrix
- If the accuracy rate for edits that came from the suggestion is less than those manually written, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
- If the accuracy rate for edits that came from the suggestion is less than 80%, we will not keep the feature in the app. The accuracy rate will be determined based on manual patrolling.
- If the time spent to complete the task using the suggestion is double the average rate as those that do not see suggestions we will need to compare it to reports to see if there are performance issues
- If time spent to complete the task using the suggestion is less than the average without a negative impact to accuracy rate, we will consider it a positive indicator to expand the feature to more users
- If users that see the suggestion modify the suggestions more often than submitting it without modification, we will evaluate its accuracy rate compared to users that did not see the suggestions and determine if the suggestion is a good starting point for users and how it differs by user experience
- If users that see the suggestion modify the suggestions more often than submitting them without modification, we will look for trends in the modification and offer a recommendation to EPFL to update the model
- If beam one is chosen more than 25% of the time than beam two while having an equal or higher accuracy rate, we will only show beam one in the future
- If users that see treatment return to the task multiple times (1,2,7,14 days) at a rate 15% or more than the control group without a negative impact to accuracy, we will take steps to expand the feature
- If our risks are triggered we will implement our contingency plan
- If users that see the treatment do not select a suggestion more than 50% of the time after viewing the suggestions, we will not expand the feature
Wikis
//This section will contain the list of wikis participating in the A/B test.
Done
- A report is published that evaluates the `Decision Matrix above, Risk Register, and experiment questions in the concept deck
Reference materials
Context Deck
Project Page