This task aims to develop a methodology for analyzing peacock language detection models in languages without enough evaluation data, as identified in T388215.
Description
Description
Details
Details
- Due Date
- Sep 29 2025, 11:00 PM
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | isarantopoulos | T400423 Q1 FY2025-26 Goal: Enable volunteer evaluation of Tone Check model in additional languages | |||
| Resolved | diego | T398930 Score probability evaluation for languages without enough data |
Event Timeline
Comment Actions
@achou thank you for creating this task! What support do you need from Research here? I think we can help expand the set of templates you are considering as "positive" peacock examples?
Comment Actions
Hi Miriam :) @diego has been working on a notebook for this. I created this task so we can link the finalized notebook here and have a place to summerize and decide the next steps.
Comment Actions
He have develop a method to analyze languages without enough evaluation data. A detailed explanation can be found in this Jupyter Notebook.
In summary:
Use this methodology for:
- Given a target language, estimate how likely is to behave similar than in enwiki.
- Learn which is the most common result in the target language and compare with known languages.Inform decisions about score thresholds
- Decide to discard a language due unexpected or unusual behavior.
Don't use this methodology for:
- Release a product in the target without human evaluation.
- Replace human evaluation
- Drive strong conclusions about model accuracy on the target languages
Key Steps:
- For each language we compute the peacock model scores for a random sample of revisions.
- We compute the score's probability distribution per language.
- Plot the distributions and see if they are normal (expected) and found outliers.
- What to analyze:
- If distribution is not normal, that might imply an unexpected behavior on the target language.
- If there are outliers (some specific score that is more usual than expected) could mean that certain content (words, sentences) are missinterpreted by the model on that language.
- What to analyze:
- Actions to take:
- If you find any of the problems mentioned above, consider a deeper investigation on that language. It is likely that we need to gather significant (more) training data on that language.
- We compare those distribution with *enwiki* (the language with more evalaution data), obtaining the following information:
- What to analyze:
- Jensen–Shannon divergence: This number tell us how similar is the behavior on the target language compared to English. If the Jensen-Shannon divergence is over 0.10 it might mean the performance (F1 score) of the model in the target language is low.
- If distribution is normal, we compare the distribution center. This gives information on how similar should be the thresholds used in each language, obtaining a similar ratio of positive cases.
- What to analyze:
- Actions to take:
- Jensen-Shannon divergence > 0.10 → It might be an indicator that language is difficult for the model, and/or that you require more training data.
- Difference between distribution centers > 0.05 → it might mean that (i) the model is over/under estimating peacock language on targeted wiki_db or (ii) that the targeted language has significaly more/less peacock behavior than English. Hence, the thresholds should be adapted on that language.
