Page MenuHomePhabricator

Formalize fallback rules for automatically determining ORES threshold levels
Open, Needs TriagePublic

Description

We have some standards for how we define ORES thresholds, assuming the models perform ideally. When models aren't so good, which might be the case for new languages, or may even be due to a regression in an existing model, we need a strategy for how to degrade the threshold levels used on-wiki. @Catrope currently uses https://jsfiddle.net/catrope/50n1ekgu/ to generate https://docs.google.com/spreadsheets/d/1c8NMubBO0AS5KOFt5og0gz3dJK0IjEB3FwYOWEOXQ2g/edit?usp=sharing , then applies savage eyeball metrics to decide whether models are satisfactory, and how to fudge the levels if they are not. For example, we might only have one level "likely damaging" and no others, since the granularity isn't good enough to warrant extra levels.

It would be nice to formalize these rules so that we can one day automate the process or at least pass it on to additional humans.

Event Timeline

The defaults are as follows, but I eyeball the results and adjust where needed. For example, if the 90% precision level has a low recall (lower than 8% or so), I'll choose the 80% or 75% precision level instead. One formalism I do use: "maybe bad" is 15% precision or 90% recall, whichever of those two produces a narrower threshold range.

Defaults:

  • maybebad: 15% precision
  • likelybad: 60% precision
  • verylikelybad: 90% precision
  • likelygood: 99.5% precision

The jsfiddle broke, because it made lots of requests in parallel and triggered 429s. Here's the new one that does work: https://jsfiddle.net/catrope/7hfg3drv/