Page MenuHomePhabricator

Produce tables of stats for damaging and goodfaith models
Closed, ResolvedPublic

Description

@jmatazzoni wanted to get an intuition for how the score probabilities. So, he's asked us to translate them into more intuitive terms.

Event Timeline

Halfak created this task.Sep 21 2016, 2:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 21 2016, 2:12 PM
Halfak added a comment.EditedSep 21 2016, 2:18 PM

Damaging

ScorePrecisionNeg prec.RecallTrue-neg. rate
95%100.0%96.2%5.5%100%
85%56.1%97.3%35.0%98.9%
75%43.4%97.7%46.5%97.5%
65%34.0%98.2%59.0%95.2%
55%29.2%98.7%72.0%92.7%
45%25.0%98.9%77.1%90.4%
35%20.9%99.2%82.8%87.2%
25%17.7%99.4%87.3%83.2%
15%14.9%99.5%91.1%78.2%
5%9.2%99.8%97.5%60.0%

Good faith

ScorePrecisionNeg Prec.RecallTrue-neg. rate
95%99.9%8.1%70.7%96.0%
85%99.7%12.4%82.5%91.9%
75%99.6%15.2%87.5%84.8%
65%99.4%18.8%90.9%77.8%
55%99.2%22.9%93.7%69.7%
45%99.1%28.3%95.5%65.7%
35%98.9%36.3%97.2%59.2%
25%98.6%41.4%98.3%44.4%
15%98.1%49.1%99.3%26.0%
5%97.7%65.8%99.9%7.4%

Thans so much for these @Halfak. The Damaging scores make sense. But I’m having trouble understanding the Good-Faith scores. According to what I see here:

  • If I’m looking for Good Faith and set the threshold at the worst setting here, 5%, I’d still be right 98% of the time and I’d capture virtually all the good faith edits.
  • Meanwhile, if I set the threshold at the best I can do when looking for bad faith (the same 5% mark) I’m wrong 98% of the time.
  • But if I go to the opposite end of the bad-faith spectrum (95% score), I’m almost never right but get 100% of bad faith edits.

Are the numbers right? Am I reading them wrong?What am I missing?

I think you are reading it wrong. At the 95% threshold, you'll find that 70.7% good-faith edits are above the threshold and ~100% of bad-faith edits are below the threshold. You'll be right 99.9% of the time about good-faith edits.

At the 5% threshold, you'll find that 60% of bad-faith edits are below that threshold and 99.9% of good-faith edits are above the threshold.

I wasn't asked to produce a precision measure for bad-faith below a threshold. If I do a little math, we can get there.

So, we have 527 bad-faith examples and 19473 good-faith examples. At the 5% threshold, we capture 60% of the bad-faith examples. 0.6 * 527 = 316.2. Below that threshold, we also have 100 - 99.9% = 0.1% of the good-faith edits. 0.001 * 19473 = 19.5. The precision of False (bad-faith) below the 5% threshold is 316.2/(316.2 + 19.5) = 94.2%.

jmatazzoni renamed this task from Produce tables of stats for damaging and goodfaith models to Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals.Sep 21 2016, 10:27 PM
jmatazzoni updated the task description. (Show Details)

I hope it's OK that I've adapted Aaron's ticket to be a place where we can discuss this issue generally.

I'd like to mark this task done rather. It would be much better to just create a new task for discussion, I think.

Halfak renamed this task from Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals to Produce tables of stats for damaging and goodfaith models.Sep 21 2016, 10:34 PM
Halfak updated the task description. (Show Details)
Halfak updated the task description. (Show Details)

I just re-created a task for tracking the increased scope. See T146334

Looks like we'll be working from T146333 after all.

jmatazzoni added a comment.EditedSep 22 2016, 8:01 PM

I went over the numbers with Pau and Roan and we have one request and two questions.

Request:

@Halfak writes:

I wasn't asked to produce a precision measure for bad-faith below a threshold. If I do a little math, we can get there.

So true, and that is my bad. If you go back to the original table I created, I asked the questions in Column C wrong. For both Damaging and Good faith, what I meant to say—and what we think will be relevant to users—is "The probability an edit of this score or LOWER...." Not higher.

We've added a new column to the tables (D) that asks these questions properly. Sorry for making you do duplicate work, but can you please fill in this information?

Question #1
Why are the numbers for what you’re calling “True negative” in the tables above (and on the spreadsheet) the same for both Damaging and Good Faith tables?

Question #2
We think the answer to this is yes, but we just need you to confirm: are the probability figures you gave (in columns B and C) truly for ranges and not just for the score requested? E.g., the Damaging probability for a 75% score is listed as 43.4%. Is that, as the column heading puts it, “The probability an edit with a score of this or higher will have problems"? I.e., the probability, accounting for the distribution, over the 75%-100% range? As I say, we think that the answer is yes, but we want to be sure.

Question #1 That's my mistake. I probably accidentally re-used data. I'll look into getting it updated.

Question #2 It doesn't really make sense to think about a score as having a value. This is where the confusion comes in. With classifiers, the evaluation metrics are all designed to answer the question, "And what would our predictions look like if were to make a threshold at this value." When we set a threshold, we're saying all probabilities above our threshold are True and the rest are False. In order to do what you'd like (what's the precision at this prediction probability), I'd need to do some calibration of the output probabilities -- which was not on our agenda but might be added.

We're also probably shooting ourselves in the foot here with a feature weighting strategy that we're using to deal with the fact that damaging edits are much less common than non-damaging. See T145809.

Question #1 That's my mistake. I probably accidentally re-used data. I'll look into getting it updated.
Question #2 It doesn't really make sense to think about a score as having a value. This is where the confusion comes in. With classifiers, the evaluation metrics are all designed to answer the question, "And what would our predictions look like if were to make a threshold at this value." When we set a threshold, we're saying all probabilities above our threshold are True and the rest are False. In order to do what you'd like (what's the precision at this prediction probability), I'd need to do some calibration of the output probabilities -- which was not on our agenda but might be added.

We're not asking for precision at a given probability, we are actually interested in thresholds. Joe just wanted to double-check that that is in fact what you gave us (which it looks and sounds like it is).

Thanks @Halfak. So sorry to pressure you again. But we need to see these numbers before we start user testing. How we present the ORES info is a big focus for the testing, and we may need to adjust the UI based on what we learn.

So, please let me know when you'll be able to fill in the requested data by? (That's the correct True-neg. rate, as above, and the new data I can only refer to as the "column D info," referring to the original data table, though I'm sure there is a more precise term.) Thanks.

Update complete.

Oh. Column D will take a little bit longer.

OK. Now the table above is completely fixed. I've replaced the useless 1-Precision column with "Negative precision" which is just a term I made up for our precision at predicting False.

This got a little tricky and it seems I'm doing this a lot so here's my code block and output for generating this:

>>> from numpy import linspace, interp
>>> import pickle
>>> 
>>> def print_stats(sm, tn, fn):
...   test_stats = sm.info()['test_stats']
...   threshs = linspace(0.05, 0.95, 10)
...   tnrs = interp(threshs, sorted(test_stats['roc']['thresholds']), sorted([1-fpr for fpr in test_stats['roc']['fprs']]))
...   recalls = interp(threshs, test_stats['precision_recall']['thresholds'], test_stats['precision_recall']['recalls'][:-1])
...   precisions = interp(threshs, test_stats['precision_recall']['thresholds'], test_stats['precision_recall']['precisions'][:-1])
...   negative_precision = lambda tn, fn, recall, tnr: (tnr*fn)/(tnr*fn + (1-recall)*tn)
...   for thresh, precision, recall, tnr in sorted(zip(threshs, precisions, recalls, tnrs), reverse=True):
...     print(thresh, round(precision, 3), round(negative_precision(tn, fn, recall, tnr), 3), round(recall, 3), round(tnr, 3))
... 
>>> 
>>> print_stats(pickle.load(open("models/enwiki.damaging.gradient_boosting.model", "rb")), 807, 19193)
0.95 1.0 0.962 0.055 1.0
0.85 0.561 0.973 0.35 0.989
0.75 0.434 0.977 0.465 0.975
0.65 0.34 0.982 0.59 0.952
0.55 0.292 0.987 0.72 0.927
0.45 0.25 0.989 0.771 0.904
0.35 0.209 0.992 0.828 0.872
0.25 0.177 0.994 0.873 0.832
0.15 0.149 0.995 0.911 0.782
0.05 0.092 0.998 0.975 0.6
>>> print_stats(pickle.load(open("models/enwiki.goodfaith.gradient_boosting.model", "rb")), 19473, 527)
0.95 0.999 0.081 0.707 0.96
0.85 0.997 0.124 0.825 0.919
0.75 0.996 0.152 0.872 0.848
0.65 0.994 0.188 0.909 0.778
0.55 0.992 0.229 0.937 0.697
0.45 0.991 0.283 0.955 0.657
0.35 0.989 0.363 0.972 0.592
0.25 0.986 0.414 0.983 0.444
0.15 0.981 0.491 0.993 0.26
0.05 0.977 0.658 0.999 0.074
Halfak closed this task as Resolved.Sep 28 2016, 9:40 PM

Here are the CORRECTED versions of the tables Aaron created

Damaging

ScorePrecisionNeg prec.RecallTrue-neg. rate
95%100.0%96.2%5.5%100%
85%56.1%97.3%35.0%98.9%
75%43.4%97.7%46.5%97.5%
65%34.0%98.2%59.0%95.2%
55%29.2%98.7%72.0%92.7%
45%25.0%98.9%77.1%90.4%
35%20.9%99.2%82.8%87.2%
25%17.7%99.4%87.3%83.2%
15%14.9%99.5%91.1%78.2%
5%9.2%99.8%97.5%60.0%

Good faith

ScorePrecisionNeg Prec.RecallTrue-neg. rate
95%99.9%8.1%70.7%96.0%
85%99.7%12.4%82.5%91.9%
75%99.6%15.2%87.5%84.8%
65%99.4%18.8%90.9%77.8%
55%99.2%22.9%93.7%69.7%
45%99.1%28.3%95.5%65.7%
35%98.9%36.3%97.2%59.2%
25%98.6%41.4%98.3%44.4%
15%98.1%49.1%99.3%26.0%
5%97.7%65.8%99.9%7.4%