Page MenuHomePhabricator

Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages
Open, Needs TriagePublic

Description

Hypothesis

If we enable ≥3 volunteers to evaluate ≥30 sample edits each, for each of the 10 new languages we are seeking to scale Tone Check to, we will learn how often volunteers agree with model predictions and be able to decide which new wikis Tone Check is ready to be deployed to.

Work included

  • Analyze probability score distributions across all prioritized languages to rule out languages where we don't predict the model will be able to make high quality predictions
  • For all languages that have probability score distributions that indicate the model might perform well, prepare evaluation data (150 samples per language)

Tentative timeline

Sep 1, 2025 to Sep 5, 2025 - George on rotation
Sep 8, 2025 to Sep 12, 2025 -

  • George working on eval data generation
  • Sucheta putting together Annotool instances

Sep 15, 2025 - George upload eval data to Annotool instances
Sep 16, 2025 - Ready to begin gathering community feedback

Reporting format

Progress update on the hypothesis for the week, including if something has shipped:

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

Any unresolved dependencies:

New lessons from the hypothesis:

Changes to the hypothesis scope or timeline:

Event Timeline

I note that the number of evaluators and the number of edits to be reviewed both changed from the last evaluation. What led you to lower these numbers?

@Trizek-WMF good question! We still ideally want to have 5+ evaluators per language, like we did last time, but we don't want this to be a blocker to moving forward if we have a strong and consistent signal about the model's performance in that language. I believe that the number of samples per evaluator is the same as last time. cc @ppelberg

Thank you, Sucheta. We will keep the 5+ as a goal, but knowing that we can have a 3+ possible end goal will help.

Report

Progress update on the hypothesis for the week, including if something has shipped:

  • Results:
wiki_dbTrueFalseTotal
ar31438469
cs102012492269
de92259351
es205146351
fa135216351
he102148250
id106165271
it511176687
lv121729
nl67295362
no354277
pl136215351
ro6638104
ru120231351
tr97254351
zh189010082898

{F66017806}

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • The data format is: wiki | revision_id | label | score
  • All data gathered can be found here.

Any emerging blockers or risks:

  • None

Any unresolved dependencies:

  • None

New lessons from the hypothesis:

  • Searching data via regexes in arwiki & nlwiki returned small amount of data points.
  • Combination of two methods concatenated:
    • Select the diffs and revisions that their comments matched the regexes containing discussion about peacock (better data quality, class balance, small space of samples).
    • Select randomly edits and diffs from new_comers without searching via regex in the comments (worse data quality, class imbalance, wider space of samples).

Changes to the hypothesis scope or timeline:

I've created Annotool instances for most of the additional languages in which we want to evaluate the model. Project numbers, translations, and labels for each of these languages can be found in this doc.

WikiProject numberTranslationsLabels addedEval data added
Czech (cs)18XX
German (de)19XX
Persian (fa)20XX
Indonesian (id)21XX
Italian (it)22XX
Latvian (lv)23XX
Dutch (nl)24XX
Norwegian (no)25XX
Polish (pl)26XX
Romanian (ro)27XX
Russian (ru)28XX
Turkish (tr)29XX
Chinese (zh)30XX
Arabic (ar)
Hebrew (he)

I encountered errors when trying to create projects for Arabic and Hebrew. I didn't have time to fully troubleshoot these errors, but I'll try again and add more detail about the errors on Monday.

Update

  • I was facing some issues with Annotool in toolforge and I could not upload datasets.
  • Right now the issue is fixed, my name is added in Admin_users so I have permission to upload the data. ✅
  • Data are gathered ✅
  • Annotool projects are created ✅
  • Translation on the prediction labels done ✅
  • In the meanwhile, I already created good quality of samples (class balance wherever was feasible) so we can deliver enough positive (peacock) data points for the wikis that we have this option. ✅
  • We can wrap up tomorrow morning, I will update the ticket.

Update

Datasets uploaded for the following wikis:

WikiProject numberTranslationsLabels addedEval data added
Czech (cs)18
German (de)19
Persian (fa)20
Indonesian (id)21
Italian (it)22
Latvian (lv)23
Dutch (nl)24
Norwegian (no)25
Polish (pl)26
Romanian (ro)27
Russian (ru)28
Turkish (tr)29
Chinese (zh)30
Arabic (ar)33
Hebrew (he)34
  • Good quality of data with class balance (on the wikis where was feasible)

Here is the average probabilities (scores) per wiki and per prediction. The probability (score) of the model per prediction indicates the level of model's certainty for its decision. We tried to gather as much higher scored predictions (high levels of certainty) for both negative and positive results (peacock/not-peacock). Here I am reporting the average probability scores from the data that we delivered to the community via annotool.

wiki_dbPredictionAvg. Probability
csNechat tak, jak je (False)0.74
csTón by měl být revidován (True)0.79
deBelassen, wie es ist (False)0.73
deDer Ton sollte überarbeitet werden (True)0.66
faلحن باید اصلاح شود (True)0.69
faهمانطور که هست بگذارید (False)0.63
heהשאר כפי שהוא (False)0.60
heיש לשנות את הטון (True)0.64
idBiarkan apa adanya (False)0.59
idTone-nya perlu direvisi (True)0.64
itIl tono dovrebbe essere rivisto (True)0.82
itLascia così com' è (False)0.69
lvAtstāt kā ir (False)0.66
lvTons ir jāpārskata (True)0.58
nlDe toon moet worden herzien (True)0.60
nlLaat zoals het is (False)0.74
noLa være som det er (False)0.63
noTonen bør revideres (True)0.62
plTon powinien zostać zmieniony (True)0.68
plZostaw tak, jak jest (False)0.59
roLasă așa cum este (False)0.59
roTonul ar trebui revizuit (True)0.67
ruОставить как есть (False)0.70
ruТон следует пересмотреть (True)0.69
trOlduğu gibi bırak (False)0.65
trTon revize edilmelidir (True)0.64
zh保持原样 (False)0.70
zh语气需要修改 (True)0.75
arwikiيجب مراجعة النبرة (True)0.59
arwikiاتركه كما هو (False)0.74
hewikiיש לשנות את הטון (True)0.64
hewikiהשאר כפי שהוא (False)0.60

@gkyziridis I'm sharing something we discussed earlier.
There seems some data quality issues in our language expansion Annotool projects. I only checked the first two projects (Hebrew and Arabic), but found different problems:

  • For Hebrew, some samples are just adding references like https://… or <ref>..</ref>
  • For Arabic, there are many samples just adding [[category:...]] templates

These samples should have been filtered out during data cleaning as they’re not actual content. But it seems the processes applied may not work for these languages. We should check all other languages to see if there are similar issues and how much they are.

I can confirm the issue with sources exist in Romanian as well

Update

During ad hoc postprocess on each wiki we can remove those problematic data points from the samples.
The issue is that in some cases , e.g., arwiki we do not have plenty of data points that are predicted as True (peacock) by the model.
I will use some points from the training dataset wherever we lack of data samples.
The goal is to provide 150 samples for each wiki and ideally balanced classes (75 True / 75 False).
I will keep this ticket updated.

Current Status:

  • hewiki clean dataset obtained
  • arwiki need to find more samples, (use training data if needed)
  • rowiki pending

Update

I had already reviewed applied ad-hoc post process on the languages below.
Please click the link on each wiki to retrieve the new postprocessed data.
✅ : Done
❌ : Not Yet

WikiReviewed_by_meReviewed_by_native_speakerPostprocessedTrue/Falsenew_data_uploaded_annotool
arwiki75/75
cswiki75/75
dewiki75/75
fawiki75/75
hewiki65/85
idwiki75/75
itwiki75/75
lvwiki45/105
nlwiki75/75
nowiki41/109
plwiki58/95
rowiki55/95
ruwiki75/75
trwiki75/75
zhwiki75/75

Would it be wise to first review the new data before upload them to annotool ?
That way we can be sure about the quality by native speakers, and avoid issues like the above as well.
I will update the ticket gradually for each wiki.

@gkyziridis Could you please write an update here about the overall status of this work, that I can use in Asana reporting? I have to submit these every Friday. I can write the sections about how many reviews we have per language right now, but would really appreciate your updates on the investigation and the overall ticket

@Sucheta-Salgaonkar-WMF, George is OoO this week, let me update the investigation here.

Progress update on the hypothesis for the week, including if something has shipped:

  • We looked into four languages (Polish, Dutch, German, and Latvian) that volunteers reported, to see if we can gather cleaner eval data from the training dataset and data generation process. We came to realize that the data can’t be used in Annotool because it doesn't guarantee tone issue text in the diff (more details in T407155#11311431).

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

  • N/A

Any emerging blockers or risks:

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • N/A

Changes to the hypothesis scope or timeline:

  • N/A

Progress update on the hypothesis for the week, including if something has shipped:

  • I've collected data samples for tone-check volunteer evaluation for Target languages: Dutch, Latvian, German, and Polish. I used training/validation tone-check data (assuring for data quality on tone issues) for Dutch and German where data were available. For Latvian and Polish, training/validation data were not available so I used querying methods from wmf_history, to gather data samples.
  • We decided to avoid Annotool and use plain spreadsheets for the evaluation due to more flexibility of loading raw text and predictions and easily use the training/validation data. Annotool expects a specific schema of the data which is based on the revision ids, that fact did not help to use easily the data that we had already gathered and we can assure that these data samples contain tone issues since they are used for the model training/validation.

Any updates on metrics related to this hypothesis (including baseline, target, or actuals, if applicable):

Any emerging blockers or risks:

  • N/A

Any unresolved dependencies:

  • N/A

New lessons from the hypothesis:

  • The main lesson learnt from this project is that it would be wise to first dig in the data that we already have in order to arrange the strategy of the evaluation.
  • First use the best quality data which are the training/validation data and then gather data from wmf_history using querying and filtering methods.
  • We need to upgrade Annotool to provide the flexibility to upload raw text with predictions, avoiding the strict schema which is based on diffs only: https://phabricator.wikimedia.org/T409866

Changes to the hypothesis scope or timeline:

  • N/A