Page MenuHomePhabricator

Score probability evaluation for languages without enough data
Closed, ResolvedPublic

Description

This task aims to develop a methodology for analyzing peacock language detection models in languages without enough evaluation data, as identified in T388215.

Details

Due Date
Sep 29 2025, 11:00 PM

Event Timeline

@achou thank you for creating this task! What support do you need from Research here? I think we can help expand the set of templates you are considering as "positive" peacock examples?

Hi Miriam :) @diego has been working on a notebook for this. I created this task so we can link the finalized notebook here and have a place to summerize and decide the next steps.

Miriam set Due Date to Sep 29 2025, 11:00 PM.
Miriam moved this task from Backlog to In Progress on the Research board.
Miriam added a subscriber: AikoChou.

Thank you @AikoChou !

He have develop a method to analyze languages without enough evaluation data. A detailed explanation can be found in this Jupyter Notebook.
In summary:

Use this methodology for:

  • Given a target language, estimate how likely is to behave similar than in enwiki.
  • Learn which is the most common result in the target language and compare with known languages.Inform decisions about score thresholds
  • Decide to discard a language due unexpected or unusual behavior.

Don't use this methodology for:

  • Release a product in the target without human evaluation.
  • Replace human evaluation
  • Drive strong conclusions about model accuracy on the target languages

Key Steps:

  1. For each language we compute the peacock model scores for a random sample of revisions.
  2. We compute the score's probability distribution per language.
  3. Plot the distributions and see if they are normal (expected) and found outliers.
      • What to analyze:
        • If distribution is not normal, that might imply an unexpected behavior on the target language.
        • If there are outliers (some specific score that is more usual than expected) could mean that certain content (words, sentences) are missinterpreted by the model on that language.
    • Actions to take:
      • If you find any of the problems mentioned above, consider a deeper investigation on that language. It is likely that we need to gather significant (more) training data on that language.
  1. We compare those distribution with *enwiki* (the language with more evalaution data), obtaining the following information:
      • What to analyze:
        • Jensen–Shannon divergence: This number tell us how similar is the behavior on the target language compared to English. If the Jensen-Shannon divergence is over 0.10 it might mean the performance (F1 score) of the model in the target language is low.
        • If distribution is normal, we compare the distribution center. This gives information on how similar should be the thresholds used in each language, obtaining a similar ratio of positive cases.
    • Actions to take:
      • Jensen-Shannon divergence > 0.10 → It might be an indicator that language is difficult for the model, and/or that you require more training data.
      • Difference between distribution centers > 0.05 → it might mean that (i) the model is over/under estimating peacock language on targeted wiki_db or (ii) that the targeted language has significaly more/less peacock behavior than English. Hence, the thresholds should be adapted on that language.

image.png (584×784 px, 41 KB)