Page MenuHomePhabricator

[SPIKE] Define process for validating Tone Check model eval data for languages staff members do not speak
Open, Needs TriagePublic

Description

This task involves the work of converging on a process for validating Tone Check model eval data for languages staff members do not speak.

The need for this task emerged in response to the October 2025 Tone Check model evaluation wherein the Editing and ML Teams:

  1. Did not vet the eval data before sharing it with volunteers
  2. Came to learn that in at least 4 languages (Dutch, Latvian, German, and Polish) were, "...heavily contaminated with vandalism, tiny irrelevant edits (e.g., adding categories, images with short captions), and obvious vandalism reverts." | source

Story

As an experienced volunteer interested in helping to vet a new tone detection model in a language I am fluent within, I want the data I'm being asked to be relevant to this aim, so that I am not needing to allocate time and attention to work that is not aligned with the task I volunteered for.

Open questions

  • 1. How will samples be generated?
  • 2. How will samples be validated?

Eval data vetting process

  1. TBD
  2. TBD
  3. TBD
  4. TBD
  5. TBD

Done

  • Define and document a process for ensuring model evaluation data is high enough quality for volunteers to review

Event Timeline

Update

Open questions

How will samples be generated?
The data that exists on Annotool were gathered using two methods:

  1. Query revisions from wmf.mediawiki_history on a specific snapshot and date window, search on the event comment using regex capturing tone-check relevant comments, (e.g. 'nowiki': 'egenreklame|npov|objektivitet|reklame|WP:Nøytralt_ståsted|WP:Selvbiografi|WP:Biografier_om_levende_personer') and select those samples assuming that they are relevant on account of the comments.

This method filters out a lot of data and does not return enough data for specific wikis such as: lvwiki, nlwiki, arwiki, dewiki, plwiki, nowiki

  1. For the languages that we could not retrieve many samples using the above logic, we queried data from a much broader space without looking in the event_comment related tone-check revisions. We searched without filters on the new edits and revisions in order to gather more samples. This way we had more data samples but we could not assure that those are tone-check related. It is always a tradeoff between quantity vs quality.

How will samples be validated?
Validation regarding the text quality (how much tone-check/peacock related) is a manual process translating and reviewing one by one the gathered samples. A native speaker could understand the quality of the text diff, and if it contains peacock words in correspondence with the model outcome (prediction and probability). Otherwise we can use translation.
Validation regarding data cleaning is made by postprocessing methods during the sampling.

Possible Solutions
  • We can use samples from the model's training/evaluation datasets in order to assure that the data samples that we'll provide on annotool will be very "tone-check/peacock" related. This delivers the issue on the quality of the text and the edits. Using training data we are also flexible on delivering high probability predictions and balanced data samples (75/75 - True/False).
  • We have training data only for the following wikis: arwiki, dewiki, eswiki, enwiki, frwiki, jawiki, nlwiki, ptwiki, ruwiki, zhwiki, (only some of them are included in the priority list by the editing team listed here: https://phabricator.wikimedia.org/T394448)
  • For the languages that we do not have training data, we are planning to tweak our data_generation_pipeline and make some adjustments in order to generate tone-check's training data for the languages that we need (currently the generation pipeline works only for english).

I have already created these datasets from the training data, they are ready for pre-evaluation before loading them on Annotool (they include the text for review, when it comes to load them in Annotool we will change their schema).
Please click in the links to see the data samples (these specific data samples are updated in https://phabricator.wikimedia.org/T400423#11213744 as well):

We can use samples from the model's training/evaluation datasets in order to assure that the data samples that we'll provide on Annotool will be very "tone-check/peacock" related.

This might not be true. In the model’s training data, we use page-level templates (e.g. {{Peacock}}) and capture revisions where these templates are added or removed. However, this doesn't mean those revisions' diff contains "tone-check/peacock" related text.

This isn't a problem for training, since we feed the revision's content into the model. But for Annotool, only the diff (the difference between this revision and the previous revision) is shown.

For example, here's a data sample I collected for lvwiki: https://lv.wikipedia.org/w/index.php?title=Rietumu_banka&diff=2578942

The diff only added the {{POV}} template—this is what will be shown in Annotool.

@gkyziridis In the dataset you uploaded, the "clean_text" column is the revision's content, which is fed into the model for training, not the same as the diff shown in Annotool.

[...]
I have already created these datasets from the training data, they are ready for pre-evaluation before loading them on Annotool (they include the text for review, when it comes to load them in Annotool we will change their schema).
Please click in the links to see the data samples (these specific data samples are updated in https://phabricator.wikimedia.org/T400423#11213744 as well):

I checked for dewiki and that still contains lots of English text. What was the comment regex used for dewiki? I'm not an active editor, but I'm sure we can ask WMDE for quick suggestions for more dewiki-specific comment-terms indicating a tone-issue.

I checked for dewiki and that still contains lots of English text. What was the comment regex used for dewiki? I'm not an active editor, but I'm sure we can ask WMDE for quick suggestions for more dewiki-specific comment-terms indicating a tone-issue.

This happens in the dewiki indeed, I parsed that point as well when digging in the training data.
The above list of wiki samples were directly parsed from the training data, so we did not use any regexes or any preprocess steps since these are the data that were fed into the mode in order to be trained to capture peacock language. Since, these data are used in the training process we assume that they contain high quality signals for training a model to capture peacock tone.
The problem raises right now is that on the one hand these data contain clear signals for peacock langue (since they were selected as training data, we can get high probability predictions), on the other hand, the revisions' diff of these samples maybe does not contains "tone-check/peacock" related text, which is what we want for Annotool.
As @AikoChou says:

We can use samples from the model's training/evaluation datasets in order to assure that the data samples that we'll provide on Annotool will be very "tone-check/peacock" related.

This might not be true. In the model’s training data, we use page-level templates (e.g. {{Peacock}}) and capture revisions where these templates are added or removed. However, this doesn't mean those revisions' diff contains "tone-check/peacock" related text.
This isn't a problem for training, since we feed the revision's content into the model. But for Annotool, only the diff (the difference between this revision and the previous revision) is shown.

I think we need to connect the following dots, and bridge the gap that raises:

  1. Querying/Parsing data from "wmf.mediawiki_history" searching peacock related revision_ids via parsing their comments does not return enough data and the model probabilities are close to 0.6 (not high certainty for the prediction).
  2. Querying/Parsing data from "wmf.mediawiki_history" without searching in the revision_id's comments (broader space, unfiltered) returns a lot of data but more random, not so much peacock related, noisy.
  3. Digging in the training/evaluation data we can have high quality peacock signals in the text, high probability scores on predictions, so the text quality is high, BUT their revision_ids diffs does not mean that include peacock related text.

Based on all the above, @AikoChou do you believe that if we tweak the data_generation_pipeline to generate data for all the languages we need, could we then assure that we can have the following prerequisites for evaluation?:

  • High quality peacock related text with clear peacock signals
  • High probabilities so high confidence on predictions
  • Enough data 150 samples ideally 75/75 balanced binary class
  • The revision_ids diff of those samples include peacock related text for Annotool

I checked for dewiki and that still contains lots of English text. What was the comment regex used for dewiki? I'm not an active editor, but I'm sure we can ask WMDE for quick suggestions for more dewiki-specific comment-terms indicating a tone-issue.

I looked into the training process the research team conducted for the current tone-check model. Indeed, we didn't use any dewiki-specific page templates because none exist (see below, from this notebook). The model only used enwiki signals for dewiki, which explains why we see so much English text in the dewiki training dataset.

'de': ([],
 {'advert': None,
  'autobiography': None,
  'fanpov': None,
  'peacock': None,
  'weasel': None}),

For the eval dataset we collected previously in T400423, we used jargons and policy pages listed for dewiki in T389445:

['WP:Neutraler_Standpunkt', 'NPOV', 'POV', 'Werbung', 'Spam', 'neutraler']

Our process for collecting eval data differs from the training/eval data collection process the research team used when creating the tone check model. The main difference: the research team uses page-level templates as signals. To evaluate the model at a more granular level on paragraphs, we use different signals to locate specific paragraphs with tone issues. Our signals are comments on reverted edits mentioning "peacock" or related terms indicating a tone issue.

Based on all the above, @AikoChou do you believe that if we tweak the data_generation_pipeline to generate data for all the languages we need, could we then assure that we can have the following prerequisites for evaluation?:

No, that data_generation_pipeline reproduced the training data generation process the research team used, so these revisions don't guarantee tone issue text in the diff.

The only way to ensure the diff includes tone issue text for Annotool is using the eval collection process (notebook) you used previously. This process uses comment terms indicating a tone issue and contains code to extract the diff for the revisions.

You can try tweaking the filters in the notebook, such as loosening the diff size conditions, expanding the revert time periods, or asking the community for more signals if possible. Alternatively, gather as many revisions as possible (broader space, unfiltered), use the same code in the notebook to extract diffs, run the model, and select high-probability predictions. Also, based on learnings from T401968, filtering by article topics (e.g. Biography) might be a good idea.

Update

You can try tweaking the filters in the notebook, such as loosening the diff size conditions, expanding the revert time periods, or asking the community for more signals if possible.

I was experimenting with all the above, tweaking the filters, dates, diff sizes, etc. But still, I cannot find enough data for the evaluation.

Alternatively, gather as many revisions as possible (broader space, unfiltered), use the same code in the notebook to extract diffs, run the model, and select high-probability predictions.

I used already that approach gathering data from a broader space without filtering but in the end the data do not include tone issue, and they are noisy as well. This approach is used already and the data can be found in this table: https://phabricator.wikimedia.org/T400423#11213744 .

So, the current status is the following:

  1. Using the logic that we search in wmf.mediawiki_history finding diffs that in their comments include any related discussion about "tone issues" does not return enough data for all the languages. In some of them this method returns 30 points for instance. This method finds good enough data that include tone issue signals but not enough samples.
  2. Tweaking filters did not help a lot, since I used a really big diff condition (>=10 & <=100000 bytes), bigger date window, etc, it did not change the results significantly.
  3. Using the approach that we are freely searching in wmf.mediawiki_history without checking their comments, returns a higher amount of samples but they do not have clear tone signals for evaluation.
  4. Using the training/validation data that we used for training and tuning the model is a good idea since we are pretty sure that those data include tone issue signals. The problem in this case is that using Annotool we need to upload: rev_id, wiki, pred_label and pred_score not just the pure text and prediction. So, the problem raises here is that the corresponding diffs/rev_ids of the training data do not include the tone issue. This is also mentioned by @AikoChou:

No, the data_generation_pipeline reproduced the training data generation process the research team used, so these revisions don't guarantee tone issue text in the diff. The only way to ensure the diff includes tone issue text for Annotool is using the eval collection process (notebook) you used previously. This process uses comment terms indicating a tone issue and contains code to extract the diff for the revisions.

If we avoid using Annotool and just use a spreadsheet for the evaluation we can easily use the training data avoiding the diffs and rev_ids and use just pure input text, wikiname, pred_label, pred_score so in this way we do not need to retrieve or analyse wiki_diffs.

As it is stated in this comment: https://phabricator.wikimedia.org/T407155#11277135, we have training data only for the following wikis: arwiki, dewiki, eswiki, enwiki, frwiki, jawiki, nlwiki, ptwiki, ruwiki, zhwiki, (only some of them are included in the priority list by the editing team listed here: https://phabricator.wikimedia.org/T394448)

Update

Target languages: Dutch, Latvian, German, and Polish
Target Wikis: nlwiki, lvwiki, dewiki, plwiki
Training/validation data available: arwiki, dewiki, eswiki, enwiki, frwiki, jawiki, nlwiki, ptwiki, ruwiki, zhwiki
Training data and Target languages intersection: nlwiki, dewiki.
We have training/validation data for Dutch and German so I have already created the corresponding two spreadsheets (please click the links): nlwiki and dewiki.

For the dewiki we had spotted an issue which is described here: T407155#11311194 regarding many english samples used for training the model in dewiki. In order to overcome this, I used translation only where the english samples exists inside the dewiki dataset.

Regarding the rest languages namely: Latvia and Polish, it is clear that we do not have training/validation data for the tone-check model which means that we need to use data samples for evaluation from different sources (not from the training/validation data samples). This means that these two languages fall into the issue that the data quality will not be the same with the other languages since we do not have training data available for the corresponding languages.

I am digging again into the data in order to overcome this issue.
Here are some data samples for these two languages (keep in mind that these are NOT training/validation data so we cannot guarantee for their quality).
These two data samples are gathered using two ways that is, search on comments in order to find something related with "tone-issue", "NPOV", etc so these revision ids will be tone-check related, and the second way is to gather unfiltered data from a broader space (e.g. new edits). Those samples are fed into the model in order to retrieve the predictions. After that, we filter on high probability scores (wherever we could) in order to deliver samples with high model's certainty. So, in the end we combine two logics for gathering the data since we do not have training/validation data available for these two languages.

wikispreadsheetUse training/validation data
Dutchnlwiki
Germandewiki
Latvialvwiki
Polishplwiki