Hypothesis
If we enable ≥3 volunteers to evaluate ≥30 sample edits each, for each of the 10 new languages we are seeking to scale Tone Check to, we will learn how often volunteers agree with model predictions and be able to decide which new wikis Tone Check is ready to be deployed to.
Work included
- Analyze probability score distributions across all prioritized languages to rule out languages where we don't predict the model will be able to make high quality predictions
- For all languages that have probability score distributions that indicate the model might perform well, prepare evaluation data (150 samples per language)
Tentative timeline
Sep 1, 2025 to Sep 5, 2025 - George on rotation
Sep 8, 2025 to Sep 12, 2025 -
- George working on eval data generation
- Sucheta putting together Annotool instances
Sep 15, 2025 - George upload eval data to Annotool instances
Sep 16, 2025 - Ready to begin gathering community feedback
Reporting format
Progress update on the hypothesis for the week, including if something has shipped:
Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):
Any emerging blockers or risks:
Any unresolved dependencies:
New lessons from the hypothesis:
Changes to the hypothesis scope or timeline: