Conform ORES sensitivity levels to the new ERI standards
Closed, ResolvedPublic
Actions

Description

The "ORES sensitivity" preference on the RC page controls which edits ORES flags with the "r" that means "should be reviewed." It also controls what edits get hidden when the user has selected "Hide probably good edits." This preference currently appears on Recent Changes, but it also controls Watchlist and Contributions (I assume).

Here are proposed changes for the Sensitivity controls, necessitated by the RC Filters beta and our move to take ORES out of beta:

Functional changes

- **Displayed by default: **The Sensitivity controls used to appear only to those who opted in to the ORES beta. But now they will appear by default for all users on all ORES wikis at such time as the RC Filters beta is released.
- **Moves to Watchlist: **The Sensitivity controls will move from the Recent Changes preferences to the Revision Scoring section of the Watchlist preferences.
- **Conform thresholds to ERI standard levels: ** There are currently three sensitivity levels—Lowest, Low and High. These levels don't correspond with our newly standardized filter thresholds. That is not a problem on Watchlist, but on Recent Changes it will become apparent to anyone who is paying attention (e.g., when they use highlighting). So unless there is a reason to keep these as they are (@Halfak?), this task proposes to conform these levels with the new system. See below for details.
- **Text changes**, as described below, to explain the functions of these controls more fully.

==Proposed UI text

[feature name]
Prediction threshold

[Explanatory text]
Sets the level of probability required before the machine-language service ORES flags edits with an "r" to indicate that they are likely to have problems and "need review" on the Recent Changes, Watchlist and Contributions pages. Also affects which edits count as "probably good" for the "Hide probably good edits" preference on those pages.

[Menu options]

May have problems (flags most problem edits but includes many false positives)
Likely have problems (medium probability) [default level]
Very likely have problems (flags few false positives but finds a smaller % of problem edits)

@Halfak, please feel free to comment on the functionality and/or language here.

Details

Subject	Repo	Branch	Lines +/-
Deduplicate ores-help-damaging-pref	mediawiki/extensions/ORES	master	+0 -1
Align damaging thresholds to filters thresholds	mediawiki/extensions/ORES	master	+76 -32
Follow-up c047cd54d69ed: rename oresDamagingPref values back	mediawiki/extensions/ORES	master	+11 -10

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Trizek-WMF	T158004 Release RC Page filtering to non-ORES wikis
Resolved	Trizek-WMF	T158225 Enable the ORES good faith and damaging UI by default, on wikis that have these ORES models available (instead of behind a Beta Feature)
Resolved	SBisson	T159763 Enable parts of ORES extension by default and manage impacts on the RC Page and the RC page Preferences tab
Resolved	SBisson	T160575 Conform ORES sensitivity levels to the new ERI standards
Resolved	Catrope	T162831 Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release

Event Timeline

• jmatazzoni removed SBisson as the assignee of this task.Mar 15 2017, 9:16 PM

• jmatazzoni created this task.

Looks OK to me. You might want to signal how much damage each one catches. This is the recall field in the corresponding test_stat. This is 82.5%, 33.6%, and 8.4% for English Wikipedia, respectively.

If not an actual value, maybe just state the tradeoff.

May have problems (flags more damage with more false positives)
Likely have problems (medium) [default level]
Very likely have problems (flags less damage with fewer false positives)

I think I defined this incorrectly (for one thing, I didn't realize that the Sensitivity control also controls the "Hide probably good" level). So I'm closing this task as invalid until I sort it all out.

• jmatazzoni mentioned this in T160475: Manage ORES preferences on Watchlist (and Contributions).Mar 16 2017, 12:08 AM

• jmatazzoni renamed this task from Conform 'ORES Sensitivity' levels to the new ERI standards to Move 'ORES Sensitivity' controls and conform levels to the new ERI standards.Mar 16 2017, 12:53 AM

• jmatazzoni reopened this task as Open.

• jmatazzoni updated the task description. (Show Details)

• jmatazzoni moved this task from Untriaged to Ready for Pickup on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.

• jmatazzoni mentioned this in T150059: Make sure all Preferences for Recent Changes are compatible with new filtering system/page tools (and that users' preferences carry over).Mar 16 2017, 1:27 AM

• jmatazzoni mentioned this in T159763: Enable parts of ORES extension by default and manage impacts on the RC Page and the RC page Preferences tab.

In T160575#3104099, @Halfak wrote:

Looks OK to me. You might want to signal how much damage each one catches. This is the recall field in the corresponding test_stat. This is 82.5%, 33.6%, and 8.4% for English Wikipedia, respectively.

Here is the source (if I'm correct).

If not an actual value, maybe just state the tradeoff.

May have problems (flags more damage with more false positives)

Likely have problems (medium) [default level]

Very likely have problems (flags less damage with fewer false

This explanation is really easy to understand!

Halfak moved this task from Parked to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.Mar 16 2017, 2:46 PM

• jmatazzoni updated the task description. (Show Details)Mar 17 2017, 10:02 PM

• jmatazzoni updated the task description. (Show Details)Mar 17 2017, 10:09 PM

• jmatazzoni updated the task description. (Show Details)

Thanks for the suggestions. I've changed the menu language along the lines suggested by @Halfak. See Description above. @Etonkovidova and @SBisson, please note the changes to the menu language above.

• jmatazzoni updated the task description. (Show Details)Mar 17 2017, 10:13 PM

Catrope renamed this task from Move 'ORES Sensitivity' controls and conform levels to the new ERI standards to Conform ORES sensitivity levels to the new ERI standards.Mar 18 2017, 12:01 AM

I looked at doing this, but one issue is that while "lowest" sensitity (0.90) aligns well with "verylikelybad" (0.92), and "low" sensitivity (0.70) aligns well with "likelybad" (0.75), "high" sensitivity (0.50) does not align at all well with "maybebad" (0.16), It's also a bit strange IMO that the maybebad threshold (0.16) is below the likelygood threshold (0.55).

I suppose I could align "high" (0.50) with "likelygood" but that seems like a hack, especially since the underlying definition is in terms of the precision for non-damaging, not the precision for damaging.

Jdforrester-WMF moved this task from Ready for Pickup to Product/Design Work on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Mar 18 2017, 1:38 AM

I assume you mean the negative of likelygood? Yes, that would be a hack. In any case, the May Have Problems filter is designed to capture 90% of damage. That is its purpose. In the process it also scoops up something like 75% of good edits. So I suppose the question is, is that an appropriate level for saying that things "need review"? It is, in fact, arguable that this filter really only makes sense in combination with some of the other features available in the full filtering system (highlighting, other filters that increase likelihood...).

I'm not sure what the numbers you're quoting are—are those ORES scores, or are those the damaging precisin levels we've pegged these tests to? They look like the ORES scores, in which case, remember that they are not the same as precision, and a 75% score provides only something like 45% precision--already less than what most people would call "likely."

So, the question is, what are people using these "r"s for? Do they want to capture all damage? Or do they use them to say "ah, this edit 'needs review.'" If the former, let's go with the plan as written. If the latter, the logical thing to do would be to eliminate the third setting altogether. Remember that even the higher figure you site (the negative of likelygood) provides a probability that edits are in bad someplace around only %25. @Catrope, do we know what percentage of ORES users have a) changed their settings at all and b) changed it to the broadest setting?

@Pginer-WMF, @Halfak-- opinions?

We discussed the possibility of eliminating the "lowest" preference, as well as the possibility of just pegging it to something other than maybebad. @Ladsgroup / @Halfak can you shed some light on why "lowest" was introduced and what it was calibrated to? T150224 does say fpr=0.1 but I don't know what that means.

For usage info see below; "lowest" is used by 27 users globally, 19 of which are on enwiki.

P5087 Usage of ORES sensitivity preference

1	Default is soft, except on wikidata. Number in (parentheses) next to wiki name is total number of users with ORES enabled on that wiki.
2
3	plwiki (444)
4	8 hard
5	2 softest
6
7	ptwiki (1612)
8	37 hard
9
10	fawiki (773)
11	12 hard
12
13	nlwiki (312)
14	5 hard
15	1 softest
16
17	ruwiki (2998)
18	24 hard
19	3 softest
20
21	trwiki (735)
22	4 hard
23	1 softest
24
25	wikidatawiki (675) [default=hard]
26	9 soft
27	1 softest
28
29	cswiki (97)
30	3 hard
31
32	enwiki (15102)
33	159 hard
34	19 softest

@Halfak, the reason Roan is asking for guidance in his question above is that we are proposing to eliminate the "High (flags more edits)" option, on the grounds that—especially once we conform this to the new ORES levels, but even before—flagging so many false positives as "needs review" undermines the credibility of ORES. And not that many people seem to use this option anyway...

Except on Wikidata, where it is the default. There must be a reason for that—can you please comment? So maybe we'd keep the broad filter on Wikidata and keep it as default?

In T160575#3115486, @jmatazzoni wrote:

that we are proposing to eliminate the "High (flags more edits)" option,

Oh, yes, sorry, we were proposing eliminating "hard", not "softest", good catch. I got confused because these preferences are named in reverse order. In that case, "hard" is the default on Wikidata and used by more than 200 users, so it's not as unused as I thought it was.

Even if we keep the "High" setting, I don't think we should conform it to "maybebad" because that would make it more problematic. We'd instead pick a precision percentage to peg it to that is more reasonable for displaying the "r" than maybebad's precision.

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017); removed Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017).Mar 20 2017, 11:41 PM

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017); removed Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017).Mar 20 2017, 11:55 PM

Damage patrollers need to catch most-all of the damaging edits. When you set a threshold for high recall (catching almost all of the damage), you're going to pick up a lot of not-damaging edits too. It's sort of like a squelch -- when you open it up, you get noise, but you can also pick up on weaker signals. The reason this is OK is because it filters out 90% of the edits that do not need review at all -- which saves a ton of work. Honestly, I think this focus on precision is backwards-braining an expert process. It simply does not reflect the actual process needs of damage patrollers. We should focus on recall first and foremost when choosing thresholds.

OK on to the specific concerns:

edits are in bad someplace around only %25

That's 10 times as likely to be vandalism than any random edit. You might say "that's not very accurate", but I think a patroller would say "that's very useful".

flagging so many false positives as "needs review" undermines the credibility of ORES.

ORES should not have credibility. ORES is a stupid/simple "first pass" that helps triage edits for human review. If people think that ORES is stupid but also useful, that's perfect.

And not that many people seem to use this option anyway...

That's OK. I am a fan of having the default not reflect the patrollers' use-case because damage patrolling is a specific, high expertise job and most people won't be doing it. However, the option must be available for people who are doing generalized damage patrolling.

OK. Thanks Aaron. Let's move forward as planned. putting this in RFP.

• jmatazzoni updated the task description. (Show Details)Mar 21 2017, 3:33 PM

Catrope added a subtask: T161363: RC filters - 'Category changes' with 'Very likely have problems' are not marked with 'r' .Mar 28 2017, 5:57 PM

Catrope removed a subtask: T161363: RC filters - 'Category changes' with 'Very likely have problems' are not marked with 'r' .

I plan to work on this tomorrow morning if nobody does it between now and then.

@jmatazzoni I think the first 2 points of "functional changes" have been done as part of the preferences tickets. Can you confirm that everything else in the task description is up to date with the current thinking and discussion? Thanks!

SBisson claimed this task.Mar 30 2017, 4:57 PM

SBisson moved this task from Ready for Pickup to In Development on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.

SBisson moved this task from In Development to Blocked on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Mar 31 2017, 5:28 PM

SBisson moved this task from Blocked to In Development on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Apr 3 2017, 4:43 PM

Change 346188 had a related patch set uploaded (by Sbisson):
[mediawiki/extensions/ORES@master] Align damaging thresholds to filters thresholds

https://gerrit.wikimedia.org/r/346188

gerritbot added a project: Patch-For-Review.Apr 3 2017, 7:03 PM

SBisson moved this task from In Development to Needs Review on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Apr 3 2017, 7:50 PM

• jmatazzoni mentioned this in T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release .Apr 12 2017, 8:34 PM

• jmatazzoni created subtask T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release .

SBisson edited projects, added Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017); removed Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017).Apr 13 2017, 12:40 PM

SBisson moved this task from Untriaged to Needs Review on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.

• jmatazzoni closed subtask T162831: Tweak ORES-Related Preferences for Watchlist and RC Page ahead of next release as Resolved.Apr 19 2017, 6:57 PM

Change 346188 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Align damaging thresholds to filters thresholds

https://gerrit.wikimedia.org/r/346188

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-05-09_(1.30.0-wmf.1)).Apr 25 2017, 10:00 PM

Change 350316 had a related patch set uploaded (by Catrope):
[mediawiki/extensions/ORES@master] Follow-up c047cd54d69ed: rename oresDamagingPref values back

https://gerrit.wikimedia.org/r/350316

Change 350316 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Follow-up c047cd54d69ed: rename oresDamagingPref values back

https://gerrit.wikimedia.org/r/350316

Change 351137 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/ORES@master] Deduplicate ores-help-damaging-pref

https://gerrit.wikimedia.org/r/351137

Change 351137 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Deduplicate ores-help-damaging-pref

https://gerrit.wikimedia.org/r/351137

SBisson moved this task from Needs Review to QA Review on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.May 5 2017, 3:15 PM

@jmatazzoni

I've tested four edits with the following ORES scores:

1- 0.998 - "verylikelybad"
2 - 0.998 - "verylikelybad"
3 - 0.823 - "likelybad"
4 - 0.510 - "maybebad"

And three 'Prediction threshold' levels:

May have problems (flags most problem edits but includes many false positives)
Likely have problems (medium probability)
Very likely have problems (flags few false positives but finds a smaller % of problem edits)

Watchlist will display:

(1) With May have problems (flags most problem edits but includes many false positives)
All four edits are displayed highlighted - two with intense yellow highlighting and two with pale pinkish color.
'Hide propbably good edits' -all four edits are displayed

(2) Likely have problems (medium probability)
Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

(3) Very likely have problems (flags few false positives but finds a smaller % of problem edits)
Only 0.998 scored edits are displayed
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad").

So, only case (2) present potentially confusing results to users: Users see a highlighted, 'r' marked result but 'Hide propbably good edits' will hide it.

QA Recommendation: Product should weigh in

Etonkovidova moved this task from QA Review to Product Review on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.May 10 2017, 9:25 PM

@Etonkovidova notes:

Likely have problems (medium probability)

Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

Based on the thresholds for en.wiki documented in this spreadsheet, that does not seem like correct behavior. .823 is the beginning of the "Likely" range, so this shouldn't be hidden with those settings.

But thinking about all this raises a question about what the appropriate function of the "Show only likely problem edits (and hide probably good edits)" is supposed to be. It strikes me there are two possible interpretations of how it should work that I'd like @Ladsgroup , @Halfak and @Pginer-WMF to comment on:

Model #1, the "threshold" is a simple dividing line: Under this model, if I set my "prediction threshold" to "Likely have problems" and check "Show only likely problem edits (and hide probably good edits)", then everything above the "Likely have problems" floor (.823 on en.wiki) will get the "r" and colored highlighting, and everything below .823 will be hidden.
Model #2, "probably good" means "probably good": In this model, the only stuff that gets hidden as "probably good" are edits in the "Very likely good" range (which in en.wiki runs from 0 to .398). In this model, some edits might logically exist in a gray area between "likely problem edits" (which get colored highlighting) and "probably good edits" (which get hidden). These would simply be displayed with no marking.

So, what do you guys think? What do Classic ORES users want? (Based on your answer, we may want to change the feature name.)

Re-tested with the ORES scoring ranges set in ORES model properties:

All works according to specs, and , probably, due to some re-adjusting ORES scoring ranges my comment below is not valid anymore:

(2) Likely have problems (medium probability)
Three edits are displayed highlighted: 0.998, 0.998, and 0.823
0.510 - "maybebad" won't be highlighted.
'Hide propbably good edits' enabled option will display only 0.998 scored edits ("verylikelybad"). 0.823 scored edit will be hidden.

There might be some user usability issues with the 'Prediction threshold' setting - but there are outside of the scope of this phab task.

it does not apply to RC page with filters
Watchlist does inform a user which 'Prediction threshold' was set
and there is a betalabs regression bug T170848: [betalabs-regression] Watchlist setting ' Show only likely problem edits (and hide probably good edits)' does not propagate to Watchlist

• jmatazzoni closed this task as Resolved.Aug 31 2017, 10:54 PM

Conform ORES sensitivity levels to the new ERI standardsClosed, ResolvedPublicActions