This task involves the work of converging on a process for validating Tone Check model eval data for languages staff members do not speak.
The need for this task emerged in response to the October 2025 Tone Check model evaluation wherein the Editing and ML Teams:
- Did not vet the eval data before sharing it with volunteers
- Came to learn that in at least 4 languages (Dutch, Latvian, German, and Polish) were, "...heavily contaminated with vandalism, tiny irrelevant edits (e.g., adding categories, images with short captions), and obvious vandalism reverts." | source
Story
As an experienced volunteer interested in helping to vet a new tone detection model in a language I am fluent within, I want the data I'm being asked to be relevant to this aim, so that I am not needing to allocate time and attention to work that is not aligned with the task I volunteered for.
Open questions
- 1. How will samples be generated?
- 2. How will samples be validated?
Eval data vetting process
- TBD
- TBD
- TBD
- TBD
- TBD
Done
- Define and document a process for ensuring model evaluation data is high enough quality for volunteers to review