Background
The team has reached 30 days of the experiment and has sufficient data to start determining next steps for the Machine Assisted Article Description experiment. We have deployed a feature flag and the feature is no longer viewable to users. This task is the parent for closing out the feature.
Close Tasks
- Remove Model Card @Isaac
- Update FAQ page to explain where the Machine Assisted Article Descriptions task went @JTannerWMF
- Report findings for Machine Assisted Article Descriptions based on key indicators listed below @SNowick_WMF (June 6 2023 first draft)
- Update ticket and project page with next steps @JTannerWMF (June 16 2023)
Key Indicators
- Machine Assisted Article Descriptions has a higher accuracy score than human generated article descriptions and it holds up across mBART languages
- Share how this score changes when Modified is T vs Modified is F
- 80% of Machine Assisted Article Descriptions has a score of 3 or higher
- Machine Assisted Article Descriptions accuracy score is not substantially lower for new users than experienced users (Experienced- 50+ edits vs less than 50)
- Time spent on Machine Assisted Article Description is about the same as human generated article description
- Beam one has a higher accuracy and selection score than beam two
- We have a higher proportion of users publishing the machine suggestions without modifications than with modifications
- People with the experiment treatment complete a higher number of descriptions in a day than those that did not
Guardrails
- The revert rate will be higher for those that did not see machine assisted article descriptions than those that did not receive the experiment
- The rewrite rate will be higher for those that see machine assisted article descriptions than those that did not receive the experiment
- The revert and rewrite rate will be lower for those that modified machine assisted article descriptions than those that published the machine assisted article description without modifications or purely human generated article descriptions
- Less than 2% of users used the report function to indicate we displayed inappropriate content
Additional Questions to Answers
- What is the frequency of our experiment group (people exposed to machine assisted article descriptions), selecting the machine suggestion and hitting publish vs. modifying a suggestion vs. Typing out the suggestion
- What feedback did we get through the reporting feature and what was the distribution of that feedback
- How often are users coming back to try machine assisted article descriptions again in a 30 day period (1, 2, 7, 14) and does it differ from the users who did not get the experiment?
- Mean vs. Median length of time to complete tasks by user tenure and response time under 5s