Page MenuHomePhabricator

Invite volunteers to review Peacock language model (v1)
Closed, ResolvedPublic

Description

In T384651, we are conducting at an initial internal review of the Peacock language detection model.

This task builds on this initial review by inviting volunteers to evaluate the evaluations the model has made so that we we can identify what – if any – adjustments needs to made to it in order to meet the precision thresholds we've set.

Stories

  • As an experienced volunteer/moderator motivated to ensure that new(er) volunteers are receiving helpful feedback that aligns with Wikipedia policies, I want to inspect the evaluations the Peacock Check model is making so that I can be confident the feedback new(er) volunteers will ultimately be receiving is constructive for them and the wiki at-large.
  • As an experienced volunteer/moderator understandably skeptical of AI...
  • As a tester, I need a simple interface to check the model.

Volunteer Review Requirements

In order for WMF staff to evaluate the performance of the model using volunteer input, they will need to:

  • Recruit at least 3 (but ideally 5+) human evaluators per language
  • Use fresh content in our eval dataset (published in last 12 months)

Previously, we'd listed the requirements of Annotool here. Although, we've since moved them to T392324.

Looking for volunteers

If you want to help us to review of the Peacock language detection model, please signup here.
The list of languages for this first phase is final. Please avoid adding another language.
The first phase of evaluation will start with the languages listed as Priority == yes and Eval Data Status == Available.
This is a first test. The end goal remains to provide Peacock check to all wikis.

Languages

WikiLanguagePriority – Editing TeamEval Data Status – ML TeamHuman Evaluation StatusVolunteer Contacts – Editing TeamNotes
ar.wikiArabicYesAvailableAnnotool not yet readysignup for test This community is potentially interested in testing the model.
cs.wikiCzechYesTBDAnnotool not yet ready@matej_suchanek
de.wikiGermanTBDAnnotool not yet ready
en.wikiEnglishYesAvailable⏳ Annotool set up in progresssignup for test
es.wikiSpanishYesAvailable⏳ Annotool set up in progresssignup for test This community is interested in testing the model.
fa.wikiPersianTBDAnnotool not yet ready@Huji, @Ladsgroup
fr.wikiFrenchYesTBD⏳ Annotool set up in progress
he.wikiHebrewYesTBDAnnotool not yet ready
hy.wikiArmenianNot availableAnnotool not yet ready@Mari_Avetisyan_WMAM
id.wikiIndonesianYesTBDAnnotool not yet readyThis community is interested in testing the model.
it.wikiItalianTBDAnnotool not yet ready
ja.wikiJapaneseYesAvailable⏳ Annotool set up in progresssignup for test
mk.wikiMacedonianNot availableAnnotool not yet ready@Ehrlich91
nl.wikiDutchTBDAnnotool not yet ready
no.wikiNorwegian BokmålTBDAnnotool not yet ready
pl.wikiPolishYesTBDAnnotool not yet ready@PMG, @Msz2001, Sławek Borewicz
pt.wikiPortugueseYesAvailable⏳ Annotool set up in progresssignup for test
ro.wikiRomanianTBDAnnotool not yet ready@Strainu
ru.wikiRussianTBDAnnotool not yet ready@Iniquity
tr.wikiTurkishYesTBDAnnotool not yet ready
uk.wikiUkrainianTBDAnnotool not yet ready
uz.wikiUzbekTBDAnnotool not yet ready@Panpanchik
zh.wikiChineseYesTBDAnnotool not yet ready@SCP-2000, @Hamishcn, @Yiming, @Stang
NOTE: languages currently noted as "Editing Team Priorities" reflect: 1) technical priorities per T388471#10781906, 2) projects that see relatively high volumes of newcomers, specifically newcomers living in Sub-Saharan Africa, and 3) projects that have expressed a willingness to experiment with Peacock Check

Process

  1. Define what tool we'll user to gather feedback through
  2. Define requirements for evaluation data *
  3. T392324: Prepare annotool for Tone Check model evaluation
  4. Invite volunteer input
  5. Evaluate results

*NOTE: if we choose to move forward with Annotool, the instance might need to be updated to meet evaluation requirement (e.g. free form text for reviewers to express nuance around evaluation)

Open questions

  • 1. What tool/format will we invite volunteers to offer feedback using? E.g. might we build a dedicated UI (see "References" below)? Might we use a spreadsheet? Something else?
  • Per T388471#10755841, we're going to use Annotool to gather input. The work to prepare Annotool will happen in T392324.
  • 2. How many edits will we need each language community to review for the results to be meaningful to the ML Team?

References

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If/when we come to learn volunteers do not find the spreadsheet approach – which Sucheta is going to mockup – easeful enough, we'll consider alternative approaches.

Oh, and @jhsoby, since you're here: might you be open to reviewing/offering feedback about the spreadsheet Sucheta is going to draft?

Hi. Responding to T388215#10655076, I can help review the model in Romanian.

If/when we come to learn volunteers do not find the spreadsheet approach – which Sucheta is going to mockup – easeful enough, we'll consider alternative approaches.

Oh, and @jhobsy, since you're here: might you be open to reviewing/offering feedback about the spreadsheet Sucheta is going to draft?

Sure 👍

@jhsoby, @Matej_Orlicky, and @Strainu: wonderful! Thank you for being open to this, y'all 🙏🏼

Next step
@SSalgaonkar-WMF will be sharing two things for you to review:

  1. The mock spreadsheet I referred to in T388471#10655725
  2. A prototype for what it could look to evaluate the model using a dedicated tool (https://annotool.toolforge.org/)

Hello, I can help review the model in Chinese (zh.wiki). Thanks.

Hello, I can help review the model in Chinese (zh.wiki). Thanks.

Oh, wonderful! Thank you, @SCP-2000. We'll @ mention you here when there is something ready for you to review...

...two things for you to review:

  1. The mock spreadsheet I referred to in T388471#10655725
  2. A prototype for what it could look to evaluate the model using a dedicated tool (https://annotool.toolforge.org/)

Update

Alright! The two artifacts are ready for review...

Request
@jhsoby, @Matej_Orlicky, @SCP-2000, @Strainu, and anyone else here: Might you be able to review the two artifacts below (spreadsheet + annotool)?

We're specifically interested in learning:

  1. Which of the two options below do you think would be most easeful to use to offer feedback about the model's evaluations?
  2. What – if any – information beyond the below can you see yourself needing in order to express an opinion about whether you think an edit introduces non-neutral language?
    • Text an edit adds/changes
    • Paragraph the added/modified text is a part of
    • Name of article the edit was made to
    • Diff showing the paragraph before and after the change
FormatOption #1: SpreadsheetOption #2: Annotool
Screenshot
image.png (598×2 px, 247 KB)
image.png (1×2 px, 226 KB)
Link to reviewEXAMPLE: Peacock language detection - model evaluationhttps://annotool.toolforge.org/projects/11
WARNING: for demonstration purposes, both options above are only showing edits in English. When we're ready for y'all to review actual edits, evaluation data will be made available for each of the languages listed in the task description.
NOTE: a big thank you to @MunizaA for creating the annotool instance 🙏🏼👏🏼

I added myself to help with Persian. I also think the wrong Matej was tagged by @ppelberg above; @matej_suchanek is the on who should have been tagged.

Personally, I prefer option #2 and embedding the review process into the Wiki interface.

@Ladsgroup this may be of interest to you too. And the wiki-embedded tooling you created for your revert-bot (Shahbaz) may be helpful here too.

Per Huji, I would prefer option 2. Perhaps we can write a remark for each individual edit if available, instead of simply tagging "yes" or "no". Thank you.

According to what was stated above, I can help review the model in Arabic (arwiki)

In my opinion, opt. 2 is the most easeful. The labeling narrows to simple yes/no, and it also contains the diff information (which the check will also use). But we won't get the "dictionary" (If yes, what words or phrases should be detected?).

I added myself to help with Persian. I also think the wrong Matej was tagged by @ppelberg above; @matej_suchanek is the on who should have been tagged.

Personally, I prefer option #2 and embedding the review process into the Wiki interface.

I agree, I dream a day, we could add such information on diff page and feed back these data to the ML infra for re-train of the models or online learning.

@Ladsgroup this may be of interest to you too. And the wiki-embedded tooling you created for your revert-bot (Shahbaz) may be helpful here too.

Yeah. I think I can be both ways. Keep me in the loop.

Update

We're specifically interested in learning:

  1. Which of the two options below do you think would be most easeful to use to offer feedback about the model's evaluations?
  2. What – if any – information beyond the below can you see yourself needing in order to express an opinion about whether you think an edit introduces non-neutral language?

@Huji, @Ladsgroup, @matej_suchanek, @SCP-2000: thank you all for thinking through this decision with us and sharing what you think about them. 🙏🏼

DECIDED
We're going to move forward with the approach y'all converged on: Option #2: Annotool.

NEXT STEPS

  1. The ML and Editing Teams will be working together to...
    1. Update annotool with: A) a text field wherein y'all (volunteers) can offer context about what specifically you consider to be neutral about a given diff/edit and B) evaluation data for the languages listed in the task description
      • Note: this work will happen in T392324.
  2. As evaluation data becomes ready ("B)"), we'll @-mention in a comment here to signal to you that you're clear to begin reviewing

While the above is happening, an ask of y'all: can you please invite a few peers you know who you think would be interested and equipped to help out with evaluating the model in the languages listed above? Ideally, there are ≥3 people/language reviewing the model.

Ok! That's it for this comment. I'm going to follow up in a separate one with responses to some other ideas/thoughts that were raised...

@Panpanchik, @Mari_Avetisyan_WMAM, @Ehrlich91, and @PMG: I've boldly added you all as people who would be open to reviewing the Peacock Check model in Uzbek, Armenian, Macedonian, and Polish when data becomes available.

If I've at all misinterpreted your willingness to contribute to the project in this way, please do let me know (here or email!

Regardless, thank you all for all of the insightful and thoughtful comments you shared during Wednesday's CEE - APP Call 🙏🏼

Hi, if it's okay for you, I can help w/ zhwiki's.

Hi, if it's okay for you, I can help w/ zhwiki's.

You helping with zh.wiki would be wonderful, @Hamishcn would be wonderful! Thank you for volunteering ^ _ ^

A resulting question: might there be a page on zh.wiki that people are likely to check who would be interested in also reviewing the model?

Reason I ask: we're seeking 3 volunteers/wiki./

Language priorities

While the languages we prioritize supporting to start will depend, in large part, on what evaluation data is available (Sucheta and Aiko are looking into this), the Editing Teams deems the following languages as priorities:

  • Wikis that use the “variants” feature (Chinese) — because the model has to infer across different language varieties
  • Languages that don’t space-separate words (Chinese, Japanese) — where the results will be very dependent on the tokenizer
  • Agglutinative languages (Turkish, Indonesian) — where the model will be very dependent on the tokenizer
  • RTL (Arabic, Hebrew) — because of UX issues

I've updated the task description to reflect the above...

Could give a hand on zhwiki if still needed

Hi, if it's okay for you, I can help w/ zhwiki's.

You helping with zh.wiki would be wonderful, @Hamishcn would be wonderful! Thank you for volunteering ^ _ ^

A resulting question: might there be a page on zh.wiki that people are likely to check who would be interested in also reviewing the model?

Reason I ask: we're seeking 3 volunteers/wiki./

Appreciate, looks like no need of posts on zhwiki now.:)

Hi @ppelberg, I can help zhwiki. :)

! In T388471#10782832, @Stang wrote:

Could give a hand on zhwiki if still needed

@Yiming + @Stang: Amazing! We'd value both of y'alls help – thank you for offering ^ _ ^

I'll @ mention you when there is something ready for you to see...

I can help with Polish

Wonderful! Thank you, @Msz2001 – I'll @ mention you when the evaluation data is ready.

Wonderful, @PMG – thank you! I've updated the task description to include them: T388471#10814490.

ppelberg updated the task description. (Show Details)

The first phase of evaluation is now done. The machine learning team has received enough evaluations for Spanish, Japanese, English, Portuguese, and French.

Future evaluations will be conducted and covered by a separate ticket. Users are invited to signup on wiki: https://www.mediawiki.org/wiki/Edit_check/Tone_Check/Model_evaluation

Hi y'all! Just wanted to send a BIG thank you to everyone who participated in evaluating the Tone Check model! We really appreciate your help.

I also wanted to post a link to the results of this evaluation: https://www.mediawiki.org/wiki/Edit_check/Tone_Check#Evaluating_the_model

If you have any questions about our learnings, please use the Tone Check discussion page to get in touch: https://www.mediawiki.org/wiki/Talk:Edit_check/Tone_Check