Page MenuHomePhabricator

Fine-tune and finalize ORES score ranges for the Quality and Intent filters
Closed, ResolvedPublic

Description

In this task, we will fine-tune and finalize the score ranges that define the 7 ORES filters evolved in T146333. In creating the 7 standardized filtering options for the ORES Quality and Intent filters, we strove to balance users' desires for accuracy versus breadth of coverage. To balance these factors, we used the numerical tables @Halfak created in T146280, which correlate ranges of ORES damaging and good-faith scores with their predicted precision and coverage stats. (This spreadsheet provides a more full-featured display of this data.)

The numerical tables provide only a coarse-grained view of the effects of setting levels at different scores, since the tables progress at 10% increments. That means we might want to fine-tune some of these settings. E.g., and in particular, for the "Very Likely Have Problems" setting, the jump in the figures table from the 85% score to the 95% score involves a leap in precision from 56% to 100%. So, every score point difference in this range has a very large effect!

Below find the following for each of the 7 ORES filters: in square brackets, the approx score ranges we've settled on; in parentheses the precision and coverage stats a given range would produce; in quotes the standard filter descriptions. Following the Definitions, the Discussion sections sketch in the thinking behind the choices we've made.

Quality filters

Definitions

  • Very likely good [0%-55%] (98.7% precision, 92.7% recall) "Highly accurate at finding almost all problem-free edits."
  • May Have problems [16%-100%] (14.9% precision, 91.1% recall) "Finds most flawed or damaging edits but with lower accuracy."
  • Likely Have Problems [75%-100%] (43% precision, 46% recall) "Finds half of flawed or damaging edits with medium accuracy."
  • Very Likely Have Problems [94%-100%] (90% precision, 8.3% recall)"Highly accurate at finding the most obvious 10% of flawed or damaging edits."

Discussion

  • Very Likely Good: with both key stats in the 90s, this score range seems about right. Aaron, do you see a need to tweak?
  • May have problems: The broad score range here—with accompanying low accuracy stat—was driven by a desire to peg coverage at a minimum of 90%. The good faith filter that uses the same “May have” & ”finds most” language covers only ~80%, but we judged broader coverage to be more important for damage than it is good faith. Is 91.1% too high?
  • Likely have problems: The goal here was to achieve a good “medium” setting. Testing shows that this “mama bear” is a popular one. Should we inch the precision down to make good on our “half” language in the description, or is this close enough?
  • Very likely have problems: We’ve reserved the terms “very likely” and “highly accurate” for figures at 90% or above. Based on our figures, the score range for 90% precision would fall someplace between 85% and 95%. @Halfak, what score range would hit that 90% mark? [Answer required]

User Intent Filters

Definitions

  • Very likely good faith [35% -100%] (98.9 accuracy, 97.2 coverage) "Highly accurate at finding almost all good-faith edits.
  • May be bad faith [0% - 65%] (77% coverage, accuracy 18.8) "Finds most bad-faith edits but with a lower accuracy.
  • Likely bad faith [0%-15%] (49.1 precision, coverage 26%) "With medium accuracy, finds the most obvious 25% of bad-faith edits."

Discussion

  • Very likely good faith: Again, with both key stats in the 90s, there’s no obvious reason to adjust.
  • May be good faith: Coverage is the goal of the “may have” filters, so we wanted to pick a score range that would justify the “finds most” language in the description (and make this roughly parallel to the “May have problems” filter). 77% seems about right…
  • Likely bad faith: Again, we're looking for the mama bear setting. Getting to the 50% accuracy range makes this comparable to the “likely have problems” filter, so this setting seems like a reasonable compromise.
  • (Note: bad faith is hard to find, so no reasonable setting exists for a “very likely bad faith” filter.)

Event Timeline

jmatazzoni renamed this task from Finalize ORES score settings for the Quality and Intent filter ranges to Fine-tune and finalize ORES score ranges for the Quality and Intent filters.Nov 2 2016, 12:28 AM

If you'd like to catch 30% of damaging edits, that would mean you'd want to set the threshold at about 0.846 and you'd get a precision of about 55%.

To get 90% precision with the damaging model, set the threshold at 0.94. That will capture 8.3% of damaging edits.

It's important to note that all of these values will change slightly when we rebuild models. So, I'll need to capture these thresholds as model evaluation statistics so that ORES can report where to set thresholds and what to expect from them.

@Halfak writes:

If you'd like to catch 30% of damaging edits, that would mean you'd want to set the threshold at about 0.846 and you'd get a precision of about 55%.
To get 90% precision with the damaging model, set the threshold at 0.94. That will capture 8.3% of damaging edits.

While it's attractive to have 90% accuracy, 8.3% coverage feels pretty low. On a page of 500 results, that would translate to fewer than 4 hits--no fun! On the other hand, precision of 55% is too low for a "very likely" setting.

I think users would tolerate a little less accuracy for a substantial improvement in coverage. The lowest I'd want to push something dubbed "highly accurate" would be about 80 or 85% precision. Can you say what the coverage stats and score thresholds would be for these precision rates?

  • 80%
  • 81%
  • 82%
  • 83%
  • 84%
  • 85%
  • 86%
  • 87%
  • 88%

If that's too hard, looked at the other way, how low does precision have to go to get to 25% coverage? What about 20%?

On the other hand, precision of 55% is too low for a "very likely" setting.

55% is 22 more likely than random. That seems "pretty likely" to me. I think we maybe have reached the limit in usefulness to our own imagination and we should let our users tell us instead.

I'll look into answering your other questions, but that will take some time.

@Halfak writes

55% is 22 more likely than random.

Yes, 22 times more likely should be very helpful. But 55% feels pretty similar to the "Likely" setting, at 43%.

So the question is, where do we set the line so it feels like a significant enough differentiation to qualify as a separate setting. Or, looked at a little differently, how many times out of 100 can we be wrong but still have users feel like the predictions are "very likely" or "highly accurate"? I suppose we could push precision down to 75% or even as low as 66% and still qualify. We'd still be right twice as often as wrong. I'd be very interested to see what the tradeoffs would look like at those levels, if it's not too much trouble. Thanks!

@Halfak, we're programming this filter now. If you can please provide the info requested above, we'll be able to finalize the spec. (I also wanted to cite some of these numbers in the Help page, so that is waiting as well.) Thanks!

@Halfak reports that these numbers are laborious to calculate, so I'll try to be as specific as I can about what we're looking for:

Problem: Between results calculated for scores of 85% and 95% on this table, the “Damaging precision” figures make a huge jump, from 56% to 100%. Since each score point moves the precision needle so much, it’s hard to interpolate and guess where the right balancing point is.

Goal: We’re looking for a balancing point that feels like it’s right "most" of the time but still gets a fair number of hits. A threshold of .94, which gives 90% precision and 8.3% coverage is very close. But I think users would sacrifice a little precision for more coverage. What do we get at thresholds of .93 or .92?