In T384651, we are conducting at an initial internal review of the Peacock language detection model.
This task builds on this initial review by inviting volunteers to evaluate the evaluations the model has made so that we we can identify what – if any – adjustments needs to made to it in order to meet the precision thresholds we've set.
Stories
- As an experienced volunteer/moderator motivated to ensure that new(er) volunteers are receiving helpful feedback that aligns with Wikipedia policies, I want to inspect the evaluations the Peacock Check model is making so that I can be confident the feedback new(er) volunteers will ultimately be receiving is constructive for them and the wiki at-large.
- As an experienced volunteer/moderator understandably skeptical of AI...
- As a tester, I need a simple interface to check the model.
Volunteer Review Requirements
In order for WMF staff to evaluate the performance of the model using volunteer input, they will need to:
- Recruit at least 3 (but ideally 5+) human evaluators per language
- Use fresh content in our eval dataset (published in last 12 months)
Previously, we'd listed the requirements of Annotool here. Although, we've since moved them to T392324.
Looking for volunteers
If you want to help us to review of the Peacock language detection model, please signup here.
The list of languages for this first phase is final. Please avoid adding another language.
The first phase of evaluation will start with the languages listed as Priority == yes and Eval Data Status == Available.
This is a first test. The end goal remains to provide Peacock check to all wikis.
Languages
| Wiki | Language | Priority – Editing Team | Eval Data Status – ML Team | Human Evaluation Status | Volunteer Contacts – Editing Team | Notes |
|---|---|---|---|---|---|---|
| ar.wiki | Arabic | Yes | Available | Annotool not yet ready | signup for test | This community is potentially interested in testing the model. |
| cs.wiki | Czech | Yes | TBD | Annotool not yet ready | @matej_suchanek | |
| de.wiki | German | TBD | Annotool not yet ready | |||
| en.wiki | English | Yes | Available | ⏳ Annotool set up in progress | signup for test | |
| es.wiki | Spanish | Yes | Available | ⏳ Annotool set up in progress | signup for test | This community is interested in testing the model. |
| fa.wiki | Persian | TBD | Annotool not yet ready | @Huji, @Ladsgroup | ||
| fr.wiki | French | Yes | TBD | ⏳ Annotool set up in progress | ||
| he.wiki | Hebrew | Yes | TBD | Annotool not yet ready | ||
| hy.wiki | Armenian | Not available | Annotool not yet ready | @Mari_Avetisyan_WMAM | ||
| id.wiki | Indonesian | Yes | TBD | Annotool not yet ready | This community is interested in testing the model. | |
| it.wiki | Italian | TBD | Annotool not yet ready | |||
| ja.wiki | Japanese | Yes | Available | ⏳ Annotool set up in progress | signup for test | |
| mk.wiki | Macedonian | Not available | Annotool not yet ready | @Ehrlich91 | ||
| nl.wiki | Dutch | TBD | Annotool not yet ready | |||
| no.wiki | Norwegian Bokmål | TBD | Annotool not yet ready | |||
| pl.wiki | Polish | Yes | TBD | Annotool not yet ready | @PMG, @Msz2001, Sławek Borewicz | |
| pt.wiki | Portuguese | Yes | Available | ⏳ Annotool set up in progress | signup for test | |
| ro.wiki | Romanian | TBD | Annotool not yet ready | @Strainu | ||
| ru.wiki | Russian | TBD | Annotool not yet ready | @Iniquity | ||
| tr.wiki | Turkish | Yes | TBD | Annotool not yet ready | ||
| uk.wiki | Ukrainian | TBD | Annotool not yet ready | |||
| uz.wiki | Uzbek | TBD | Annotool not yet ready | @Panpanchik | ||
| zh.wiki | Chinese | Yes | TBD | Annotool not yet ready | @SCP-2000, @Hamishcn, @Yiming, @Stang | |
Process
- ✅
Define what tool we'll user to gather feedback through - ✅
Define requirements for evaluation data* - T392324: Prepare annotool for Tone Check model evaluation
- Invite volunteer input
- Evaluate results
*NOTE: if we choose to move forward with Annotool, the instance might need to be updated to meet evaluation requirement (e.g. free form text for reviewers to express nuance around evaluation)
Open questions
- 1. What tool/format will we invite volunteers to offer feedback using? E.g. might we build a dedicated UI (see "References" below)? Might we use a spreadsheet? Something else?
- Per T388471#10755841, we're going to use Annotool to gather input. The work to prepare Annotool will happen in T392324.
- 2. How many edits will we need each language community to review for the results to be meaningful to the ML Team?
References
- mw:Add an image/2025 algorithm test via Growth Team
- https://annotool.toolforge.org/projects/8
- https://alis-evaluation.toolforge.org/
- A tool the Growth Team built to evaluate article-level image suggestions
- https://editcheck.sthottingal.workers.dev/?revision=1263744496

