In this task, we will fine-tune and finalize the score ranges that define the 7 ORES filters evolved in T146333. In creating the 7 standardized filtering options for the ORES Quality and Intent filters, we strove to balance users' desires for accuracy versus breadth of coverage. To balance these factors, we used the numerical tables @Halfak created in T146280, which correlate ranges of ORES damaging and good-faith scores with their predicted precision and coverage stats. (This spreadsheet provides a more full-featured display of this data.)
The numerical tables provide only a coarse-grained view of the effects of setting levels at different scores, since the tables progress at 10% increments. That means we might want to fine-tune some of these settings. E.g., and in particular, for the "Very Likely Have Problems" setting, the jump in the figures table from the 85% score to the 95% score involves a leap in precision from 56% to 100%. So, every score point difference in this range has a very large effect!
Below find the following for each of the 7 ORES filters: in square brackets, the approx score ranges we've settled on; in parentheses the precision and coverage stats a given range would produce; in quotes the standard filter descriptions. Following the Definitions, the Discussion sections sketch in the thinking behind the choices we've made.
Quality filters
Definitions
- Very likely good [0%-55%] (98.7% precision, 92.7% recall) "Highly accurate at finding almost all problem-free edits."
- May Have problems [16%-100%] (14.9% precision, 91.1% recall) "Finds most flawed or damaging edits but with lower accuracy."
- Likely Have Problems [75%-100%] (43% precision, 46% recall) "Finds half of flawed or damaging edits with medium accuracy."
- Very Likely Have Problems [94%-100%] (90% precision, 8.3% recall)"Highly accurate at finding the most obvious 10% of flawed or damaging edits."
Discussion
- Very Likely Good: with both key stats in the 90s, this score range seems about right. Aaron, do you see a need to tweak?
- May have problems: The broad score range here—with accompanying low accuracy stat—was driven by a desire to peg coverage at a minimum of 90%. The good faith filter that uses the same “May have” & ”finds most” language covers only ~80%, but we judged broader coverage to be more important for damage than it is good faith. Is 91.1% too high?
- Likely have problems: The goal here was to achieve a good “medium” setting. Testing shows that this “mama bear” is a popular one. Should we inch the precision down to make good on our “half” language in the description, or is this close enough?
- Very likely have problems: We’ve reserved the terms “very likely” and “highly accurate” for figures at 90% or above. Based on our figures, the score range for 90% precision would fall someplace between 85% and 95%. @Halfak, what score range would hit that 90% mark? [Answer required]
User Intent Filters
Definitions
- Very likely good faith [35% -100%] (98.9 accuracy, 97.2 coverage) "Highly accurate at finding almost all good-faith edits.
- May be bad faith [0% - 65%] (77% coverage, accuracy 18.8) "Finds most bad-faith edits but with a lower accuracy.
- Likely bad faith [0%-15%] (49.1 precision, coverage 26%) "With medium accuracy, finds the most obvious 25% of bad-faith edits."
Discussion
- Very likely good faith: Again, with both key stats in the 90s, there’s no obvious reason to adjust.
- May be good faith: Coverage is the goal of the “may have” filters, so we wanted to pick a score range that would justify the “finds most” language in the description (and make this roughly parallel to the “May have problems” filter). 77% seems about right…
- Likely bad faith: Again, we're looking for the mama bear setting. Getting to the 50% accuracy range makes this comparable to the “likely have problems” filter, so this setting seems like a reasonable compromise.
- (Note: bad faith is hard to find, so no reasonable setting exists for a “very likely bad faith” filter.)