Page MenuHomePhabricator

Conform ORES sensitivity levels to the new ERI standards
Closed, ResolvedPublic

Description

The "ORES sensitivity" preference on the RC page controls which edits ORES flags with the "r" that means "should be reviewed." It also controls what edits get hidden when the user has selected "Hide probably good edits." This preference currently appears on Recent Changes, but it also controls Watchlist and Contributions (I assume).

Here are proposed changes for the Sensitivity controls, necessitated by the RC Filters beta and our move to take ORES out of beta:

Functional changes


- **Displayed by default: **The Sensitivity controls used to appear only to those who opted in to the ORES beta. But now they will appear by default for all users on all ORES wikis at such time as the RC Filters beta is released.
- **Moves to Watchlist: **The Sensitivity controls will move from the Recent Changes preferences to the Revision Scoring section of the Watchlist preferences.
- **Conform thresholds to ERI standard levels: ** There are currently three sensitivity levels—Lowest, Low and High. These levels don't correspond with our newly standardized filter thresholds. That is not a problem on Watchlist, but on Recent Changes it will become apparent to anyone who is paying attention (e.g., when they use highlighting). So unless there is a reason to keep these as they are (@Halfak?), this task proposes to conform these levels with the new system. See below for details.
- **Text changes**, as described below, to explain the functions of these controls more fully.

==Proposed UI text

[feature name]
Prediction threshold

[Explanatory text]
Sets the level of probability required before the machine-language service ORES flags edits with an "r" to indicate that they are likely to have problems and "need review" on the Recent Changes, Watchlist and Contributions pages. Also affects which edits count as "probably good" for the "Hide probably good edits" preference on those pages.

[Menu options]

  • May have problems (flags most problem edits but includes many false positives)
  • Likely have problems (medium probability) [default level]
  • Very likely have problems (flags few false positives but finds a smaller % of problem edits)

@Halfak, please feel free to comment on the functionality and/or language here.

Event Timeline

jmatazzoni removed SBisson as the assignee of this task.Mar 15 2017, 9:16 PM
jmatazzoni created this task.

Looks OK to me. You might want to signal how much damage each one catches. This is the recall field in the corresponding test_stat. This is 82.5%, 33.6%, and 8.4% for English Wikipedia, respectively.

If not an actual value, maybe just state the tradeoff.

  • May have problems (flags more damage with more false positives)
  • Likely have problems (medium) [default level]
  • Very likely have problems (flags less damage with fewer false positives)
jmatazzoni closed this task as Invalid.Mar 15 2017, 9:43 PM

I think I defined this incorrectly (for one thing, I didn't realize that the Sensitivity control also controls the "Hide probably good" level). So I'm closing this task as invalid until I sort it all out.

jmatazzoni renamed this task from Conform 'ORES Sensitivity' levels to the new ERI standards to Move 'ORES Sensitivity' controls and conform levels to the new ERI standards.Mar 16 2017, 12:53 AM
jmatazzoni reopened this task as Open.
jmatazzoni updated the task description. (Show Details)

Looks OK to me. You might want to signal how much damage each one catches. This is the recall field in the corresponding test_stat. This is 82.5%, 33.6%, and 8.4% for English Wikipedia, respectively.

Here is the source (if I'm correct).

If not an actual value, maybe just state the tradeoff.

  • May have problems (flags more damage with more false positives)
  • Likely have problems (medium) [default level]
  • Very likely have problems (flags less damage with fewer false

This explanation is really easy to understand!

jmatazzoni updated the task description. (Show Details)

Thanks for the suggestions. I've changed the menu language along the lines suggested by @Halfak. See Description above. @Etonkovidova and @SBisson, please note the changes to the menu language above.

Catrope renamed this task from Move 'ORES Sensitivity' controls and conform levels to the new ERI standards to Conform ORES sensitivity levels to the new ERI standards.Mar 18 2017, 12:01 AM

I looked at doing this, but one issue is that while "lowest" sensitity (0.90) aligns well with "verylikelybad" (0.92), and "low" sensitivity (0.70) aligns well with "likelybad" (0.75), "high" sensitivity (0.50) does not align at all well with "maybebad" (0.16), It's also a bit strange IMO that the maybebad threshold (0.16) is below the likelygood threshold (0.55).

I suppose I could align "high" (0.50) with "likelygood" but that seems like a hack, especially since the underlying definition is in terms of the precision for non-damaging, not the precision for damaging.

I assume you mean the negative of likelygood? Yes, that would be a hack. In any case, the May Have Problems filter is designed to capture 90% of damage. That is its purpose. In the process it also scoops up something like 75% of good edits. So I suppose the question is, is that an appropriate level for saying that things "need review"? It is, in fact, arguable that this filter really only makes sense in combination with some of the other features available in the full filtering system (highlighting, other filters that increase likelihood...).

I'm not sure what the numbers you're quoting are—are those ORES scores, or are those the damaging precisin levels we've pegged these tests to? They look like the ORES scores, in which case, remember that they are not the same as precision, and a 75% score provides only something like 45% precision--already less than what most people would call "likely."

So, the question is, what are people using these "r"s for? Do they want to capture all damage? Or do they use them to say "ah, this edit 'needs review.'" If the former, let's go with the plan as written. If the latter, the logical thing to do would be to eliminate the third setting altogether. Remember that even the higher figure you site (the negative of likelygood) provides a probability that edits are in bad someplace around only %25. @Catrope, do we know what percentage of ORES users have a) changed their settings at all and b) changed it to the broadest setting?

@Pginer-WMF, @Halfak-- opinions?

We discussed the possibility of eliminating the "lowest" preference, as well as the possibility of just pegging it to something other than maybebad. @Ladsgroup / @Halfak can you shed some light on why "lowest" was introduced and what it was calibrated to? T150224 does say fpr=0.1 but I don't know what that means.

For usage info see below; "lowest" is used by 27 users globally, 19 of which are on enwiki.

1Default is soft, except on wikidata. Number in (parentheses) next to wiki name is total number of users with ORES enabled on that wiki.
2
3plwiki (444)
48 hard
52 softest
6
7ptwiki (1612)
837 hard
9
10fawiki (773)
1112 hard
12
13nlwiki (312)
145 hard
151 softest
16
17ruwiki (2998)
1824 hard
193 softest
20
21trwiki (735)
224 hard
231 softest
24
25wikidatawiki (675) [default=hard]
269 soft
271 softest
28
29cswiki (97)
303 hard
31
32enwiki (15102)
33159 hard
3419 softest

@Halfak, the reason Roan is asking for guidance in his question above is that we are proposing to eliminate the "High (flags more edits)" option, on the grounds that—especially once we conform this to the new ORES levels, but even before—flagging so many false positives as "needs review" undermines the credibility of ORES. And not that many people seem to use this option anyway...

Except on Wikidata, where it is the default. There must be a reason for that—can you please comment? So maybe we'd keep the broad filter on Wikidata and keep it as default?

that we are proposing to eliminate the "High (flags more edits)" option,

Oh, yes, sorry, we were proposing eliminating "hard", not "softest", good catch. I got confused because these preferences are named in reverse order. In that case, "hard" is the default on Wikidata and used by more than 200 users, so it's not as unused as I thought it was.

Even if we keep the "High" setting, I don't think we should conform it to "maybebad" because that would make it more problematic. We'd instead pick a precision percentage to peg it to that is more reasonable for displaying the "r" than maybebad's precision.

Damage patrollers need to catch most-all of the damaging edits. When you set a threshold for high recall (catching almost all of the damage), you're going to pick up a lot of not-damaging edits too. It's sort of like a squelch -- when you open it up, you get noise, but you can also pick up on weaker signals. The reason this is OK is because it filters out 90% of the edits that do not need review at all -- which saves a ton of work. Honestly, I think this focus on precision is backwards-braining an expert process. It simply does not reflect the actual process needs of damage patrollers. We should focus on recall first and foremost when choosing thresholds.

OK on to the specific concerns:

edits are in bad someplace around only %25

That's 10 times as likely to be vandalism than any random edit. You might say "that's not very accurate", but I think a patroller would say "that's very useful".

flagging so many false positives as "needs review" undermines the credibility of ORES.

ORES should not have credibility. ORES is a stupid/simple "first pass" that helps triage edits for human review. If people think that ORES is stupid but also useful, that's perfect.

And not that many people seem to use this option anyway...

That's OK. I am a fan of having the default not reflect the patrollers' use-case because damage patrolling is a specific, high expertise job and most people won't be doing it. However, the option must be available for people who are doing generalized damage patrolling.

OK. Thanks Aaron. Let's move forward as planned. putting this in RFP.

I plan to work on this tomorrow morning if nobody does it between now and then.

@jmatazzoni I think the first 2 points of "functional changes" have been done as part of the preferences tickets. Can you confirm that everything else in the task description is up to date with the current thinking and discussion? Thanks!

Change 346188 had a related patch set uploaded (by Sbisson):
[mediawiki/extensions/ORES@master] Align damaging thresholds to filters thresholds

https://gerrit.wikimedia.org/r/346188

Change 346188 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Align damaging thresholds to filters thresholds

https://gerrit.wikimedia.org/r/346188

Change 350316 had a related patch set uploaded (by Catrope):
[mediawiki/extensions/ORES@master] Follow-up c047cd54d69ed: rename oresDamagingPref values back

https://gerrit.wikimedia.org/r/350316

Change 350316 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Follow-up c047cd54d69ed: rename oresDamagingPref values back

https://gerrit.wikimedia.org/r/350316

Change 351137 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/ORES@master] Deduplicate ores-help-damaging-pref

https://gerrit.wikimedia.org/r/351137

Change 351137 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Deduplicate ores-help-damaging-pref

https://gerrit.wikimedia.org/r/351137

@jmatazzoni

I've tested four edits with the following ORES scores:

1- 0.998 - "verylikelybad"
2 - 0.998 - "verylikelybad"
3 - 0.823 - "likelybad"
4 - 0.510 - "maybebad"

And three 'Prediction threshold' levels:

  • May have problems (flags most problem edits but includes many false positives)
  • Likely have problems (medium probability)
  • Very likely have problems (flags few false positives but finds a smaller % of problem edits)

Watchlist will display:

(1) With May have problems (flags most problem edits but includes many false positives)
All four edits are displayed highlighted - two with intense yellow highlighting and two with pale pinkish color.
'Hide propbably good edits' -all four edits are displayed

(2) Likely have problems (medium probability)
Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

(3) Very likely have problems (flags few false positives but finds a smaller % of problem edits)
Only 0.998 scored edits are displayed
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad").

So, only case (2) present potentially confusing results to users: Users see a highlighted, 'r' marked result but 'Hide propbably good edits' will hide it.

QA Recommendation: Product should weigh in

@Etonkovidova notes:

  1. Likely have problems (medium probability)

Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

Based on the thresholds for en.wiki documented in this spreadsheet, that does not seem like correct behavior. .823 is the beginning of the "Likely" range, so this shouldn't be hidden with those settings.

But thinking about all this raises a question about what the appropriate function of the "Show only likely problem edits (and hide probably good edits)" is supposed to be. It strikes me there are two possible interpretations of how it should work that I'd like @Ladsgroup , @Halfak and @Pginer-WMF to comment on:

  • Model #1, the "threshold" is a simple dividing line: Under this model, if I set my "prediction threshold" to "Likely have problems" and check "Show only likely problem edits (and hide probably good edits)", then everything above the "Likely have problems" floor (.823 on en.wiki) will get the "r" and colored highlighting, and everything below .823 will be hidden.
  • Model #2, "probably good" means "probably good": In this model, the only stuff that gets hidden as "probably good" are edits in the "Very likely good" range (which in en.wiki runs from 0 to .398). In this model, some edits might logically exist in a gray area between "likely problem edits" (which get colored highlighting) and "probably good edits" (which get hidden). These would simply be displayed with no marking.

So, what do you guys think? What do Classic ORES users want? (Based on your answer, we may want to change the feature name.)

Re-tested with the ORES scoring ranges set in ORES model properties:

All works according to specs, and , probably, due to some re-adjusting ORES scoring ranges my comment below is not valid anymore:

(2) Likely have problems (medium probability)
Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

There might be some user usability issues with the 'Prediction threshold' setting - but there are outside of the scope of this phab task.

jmatazzoni closed this task as Resolved.Aug 31 2017, 10:54 PM