Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals
Closed, ResolvedPublic
Actions

Description

By dividing the spectrum of ORES scores into discrete levels and labeling these in a way that makes their predictive value clear, enable users to effectively use the system to meet their reviewing needs. In the immediate future, this system will be used on the ERI revisions to Recent Changes and in the ReviewStream feed.

As of Nov. 1 2016, here's how we're defining and describing these levels :

CONTRIBUTION QUALITY [Damaging]

Very likely good
Highly accurate at finding almost all problem-free edits.
May have problems
Finds most flawed or damaging edits but with lower accuracy.
Likely have problems
Finds half of flawed or damaging edits with medium accuracy.
Very likely have problems
Highly accurate at finding the most obvious 30% of flawed or damaging edits.

USER INTENT [Good Faith]

Very likely good faith
Highly accurate at finding almost all good-faith edits.
May be bad faith
Finds most bad-faith edits but with a lower accuracy.
Likely bad faith
Highly accurate at finding most obvious obvious 20% of bad-faith edits.

Related Objects
Search...

Status	Assigned	Task
Resolved	• DannyH	T171977 Annual Plan 2017-2018, Audiences 5: Increase current editor retention and engagement
Resolved	• DannyH	T171981 Annual Plan 2017-2018, Audiences 5, Goal 2: Give better ways to monitor contributions
Resolved	• jmatazzoni	T157642 Graduate New Filters UX out of beta on Recent Changes on ALL wikis
Resolved	• jmatazzoni	T144458 Launch ERI RC page features as a Beta Feature to all wikis
Resolved	Mooeypoo	T144448 Build all front-end elements for the new Recent Changes (RC) Page user interface
Resolved	Mooeypoo	T149385 Approved interface text for RC page interface elements
Resolved	SBisson	T164997 Change language describing "Likely" filters to avoid mentioning "May" filters
Resolved	Mooeypoo	T149391 Build user interface for Active Filter Display Area
Resolved	Mooeypoo	T156427 Implement the Conflict display states and messages
Resolved	Mooeypoo	T156861 RCFilters UI: Implement 'conflicting' property
Resolved	Mooeypoo	T160803 Implement corrected Conflict State tooltips and Results Area messages
Resolved	Mooeypoo	T160935 Make the generic "recentchanges-noresults" message bold
Resolved	• Mattflaschen-WMF	T161325 Remove Category Changs vs. ORES from list of Conflict States
Resolved	SBisson	T161665 Implement conflict state between Non-Minor Edits and Wikidata (and remove conflict between Minor Edits and Wikidata)
Resolved	• jmatazzoni	T156534 RC filters - some filter selections cause "No active filters. All contributions are shown" to be wrongly shown
Resolved	Mooeypoo	T156864 RCFilters: Implement 'subset' property for filter items
Resolved	Mooeypoo	T156860 RCFilters UI: Implement 'full coverage' status for groups
Resolved	Mooeypoo	T159966 Unbalanced spacing above and below the active filters label
Resolved	• jmatazzoni	T149435 Build user interface for the Filter Search Bar
Resolved	Mooeypoo	T156429 RC - Selecting all filters in a group makes them grey out
Resolved	• jmatazzoni	T156214 RC search filter bar double click - suggestions are not displayed
Resolved	SBisson	T156215 RC - filters group names and descriptions should be searchable
Resolved	Mooeypoo	T157189 RC filters: filter selector drop down is misplaced in RTL
Resolved	Mooeypoo	T159768 Implement arrow keys in the Dropdown Filter Panel for results found by filter search
Resolved	Mooeypoo	T161493 When pressing esc key the search field should lose the input focus
Resolved	• jmatazzoni	T149452 Build user interface for the Dropdown Filter Panel
Resolved	• jmatazzoni	T158118 Jarring scrolling bug in dropdown panel for RC Page filters
Open	None	T158956 The filter panel for Recent changes does not adapt well to narrow windows
Resolved	Mooeypoo	T159186 Implement 'What's This' Links on the dropdown filter panel
Resolved	• jmatazzoni	T160779 Change the text of the 'What's This?' popups
Resolved	Mooeypoo	T160213 Close small gap between top of the Dropdown Filter Panel and the bottom of the Filter Search Bar
Resolved	Mooeypoo	T149467 Build user interface for the Highlight Tools and implement highlighting in the Edit Results List
Declined	None	T159503 [minor] RC filters -should 'Restore default filters' return 'Highlight results' button to inactive state?
Resolved	Mooeypoo	T159586 RC filters - colored bullet points displayed overlapped
Resolved	Mooeypoo	T159587 RC filters - add tooltips to 'Highlight results' button and to the highlight menu
Resolved	• Mattflaschen-WMF	T151873 Redirect when a URL cannot be adapted to the new filter system for Recent Changes
Resolved	• Mattflaschen-WMF	T152754 Configure filters in a single extensible place
Resolved	• jmatazzoni	T150059 Make sure all Preferences for Recent Changes are compatible with new filtering system/page tools (and that users' preferences carry over)
Resolved	None	T159300 Turn off 'classic' ORES highlighting on the RC page
Invalid	None	T159369 Send data structure and messages for filters used in unstructured UI
Resolved	SBisson	T153949 Update RC page results without reloading the page (AJAX) when filters are changed
Resolved	Mooeypoo	T156532 RCFilters UI: Split ResourceLoader modules
Resolved	• jmatazzoni	T158006 The Active Filter Display Area - adjustments to RC page layout
Resolved	SBisson	T160092 Client should handle default for 'string_options' groups
Resolved	Catrope	T166377 Change four-dot ellipses to three-dots in 2 filter descriptions
Resolved	• jmatazzoni	T150715 Release strategy for RC page improvements: what wikis get the new features when?
Resolved	SBisson	T158819 Make it so the Help link on the RC page links to beta feature help for beta users
Resolved	Pginer-WMF	T147054 Design: banners for help pages directing users to either the beta help or the old RC Page help
Duplicate	None	T142783 Add ORES Good Faith test to Recent Changes pages (for wikis on which ORES is enabled)
Resolved	Pginer-WMF	T142785 Design interface for displaying and filtering ORES Good-Faith and Damaging scores as well as New Users flag
Resolved	• jmatazzoni	T145875 Create and maintain Edit Review Improvements documentation
Resolved	Trizek-WMF	T146669 Create dedicated pages for ERI Recent Changes Beta project
Resolved	• jmatazzoni	T146333 Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals
Resolved	Catrope	T150959 Integrate a feedback page link in Recent Changes Beta filters

Event Timeline

These are the CORRECTED versions of the tables @Halfak created in T146280. There are also annotated versions that show where certain thresholds might be set in this Google Sheets document.

Damaging

Score	Precision	Neg prec.	Recall	True-neg. rate
95%	100.0%	96.2%	5.5%	100%
85%	56.1%	97.3%	35.0%	98.9%
75%	43.4%	97.7%	46.5%	97.5%
65%	34.0%	98.2%	59.0%	95.2%
55%	29.2%	98.7%	72.0%	92.7%
45%	25.0%	98.9%	77.1%	90.4%
35%	20.9%	99.2%	82.8%	87.2%
25%	17.7%	99.4%	87.3%	83.2%
15%	14.9%	99.5%	91.1%	78.2%
5%	9.2%	99.8%	97.5%	60.0%

Good faith

Score	Precision	Neg Prec.	Recall	True-neg. rate
95%	99.9%	8.1%	70.7%	96.0%
85%	99.7%	12.4%	82.5%	91.9%
75%	99.6%	15.2%	87.5%	84.8%
65%	99.4%	18.8%	90.9%	77.8%
55%	99.2%	22.9%	93.7%	69.7%
45%	99.1%	28.3%	95.5%	65.7%
35%	98.9%	36.3%	97.2%	59.2%
25%	98.6%	41.4%	98.3%	44.4%
15%	98.1%	49.1%	99.3%	26.0%
5%	97.7%	65.8%	99.9%	7.4%

• jmatazzoni removed a parent task: T146280: Produce tables of stats for damaging and goodfaith models.Sep 21 2016, 10:37 PM

Halfak merged a task: T146334: Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals.Sep 21 2016, 10:38 PM

Halfak mentioned this in T146280: Produce tables of stats for damaging and goodfaith models.Sep 21 2016, 10:40 PM

• jmatazzoni edited projects, added Collab-Team-Q1-July-Sep-2016; removed Collaboration-Team-Triage.Sep 21 2016, 10:41 PM

• jmatazzoni moved this task from Untriaged to Product/Design Work on the Collab-Team-Q1-July-Sep-2016 board.

• jmatazzoni claimed this task.Sep 21 2016, 10:43 PM

• jmatazzoni updated the task description. (Show Details)

• jmatazzoni added a parent task: T142785: Design interface for displaying and filtering ORES Good-Faith and Damaging scores as well as New Users flag.Sep 21 2016, 10:46 PM

For those not following the whole discussion, when thinking about this the concepts I found more helpful are:

Precision. How accurate or strict we are when selecting contributions given a criteria. The more precise, the less false positives we'll have. For example, a 99.9% precision prediction of good-faith edits will give you good-faith edits most of the time with very few (0.1%) of edits that are not good. The false positives are labelled as "1-Prec" in the tables.
Coverage ("Recall" in the tables). How many of the relevant contributions we are considering given a criteria. For example, the 99,9% precision filter is so strict that it will catch only 70.7% of the potential good-faith edits. For activities where users want to explore all possible good-faith edits (even if that imply being exposed to more false positives) a filter with a wider coverage may be needed.

Precision and coverage are obviously contradicting requirements. The more strict you are selecting items that are very sure meet your criteria (higher precision), the more items you are leaving out of the selection (lower recall).

Halfak moved this task from Parked to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.Sep 22 2016, 2:23 PM

Very well said, @Pginer-WMF.

Updated table: T146280#2655978

This annotated Sheets document reproduces the number tables above and shows the proposed threshold levels in the context of the full range of score data (levels are approximate and a bit fudged because of a desire to round off numbers).

Below please find the proposed UI text for the various levels. I tried numerous variations on the wording and arrangement. In particular, I tried to see if I could communicate the information without resorting to % numbers. In the end I judged that a failure, in part because the levels themselves aren't symmetrical—there are an unequal number of positive and negative filters—and partly because the accuracy and recall numbers the levels represent vary in asymmetrical and inconsistent ways. Also, we believe it to be the case that vandalism fighters, for example, will in fact want to know what % of vandalism is identified; saying "some" isn't helpful, and saying "a third" is just a less concise way to say 30%.

All of which is to say the numbers are perhaps not the most satisfying feature here, but I believe they serve a purpose. And with a little polishing I hope this will be good enough to test. (I suspect that if we want to improve the descriptions further, some kind of data visualization may be needed—symbols for false positives vs. recall or some such.)

Two last related changes to the previous UI descriptions are worth noting: 1) You'll now find only 1 positive threshold for both tests; the numbers, all in the 90%s, should explain why, and 2) because of that, the order of filters has been reversed, so that the negative qualities are now at the top. Although we prefer to accentuate the positive, with only 1 positive filter for each test, it was much harder to see that the filters are on a continuum when that 1 filter went first.

And with that, here is the suggested language:

CONTRIBUTION QUALITY

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

USER INTENT

Likely bad faith
50% prediction accuracy. Finds 25% of bad-faith edits.

May be bad faith
20% prediction accuracy. Finds 75% of bad-faith edits.

Very likely good faith
99% prediction accuracy. Finds 90% of good faith edits.

Ready for a page on Mediawiki.org?

Use design to emphasize search objects and +/- switches they undergo
I was looking at the version of the ORES language above (which I put there Fri.). And I finally put my finger on something that was bothering me about that arrangement: it doesn't adequately emphasize an incredibly important aspect of the functionality. Which is, that the filters change from BAD to GOOD. The object of the searches literally reverses. E.g., the user is looking for problem edits, then, suddenly, for edits with no problems. And the same for good faith. The results are very different! But the options look similar and are under the same rubric.

By emphasizing the objects of the searches and the -/+ switches they go through, we make it much easier for users to apprehend what the tools do.
@Pginer-WMF, for testing, can you please use design to emphasize the object dimension for both filter groups. Use color, typography, graphics, etc.

E.g, see the example below, There I've used capitalization as a crude but effective way to highlight this important aspect. I think it's already a big improvement.

CONTRIBUTION QUALITY

Very likely PROBLEMS
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely PROBLEMS
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have PROBLEMS
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely GOOD
99% prediction accuracy. Finds 92% of problem-free edits

USER INTENT

Likely BAD FAITH
50% prediction accuracy. Finds 25% of bad-faith edits.

May be BAD FAITH
20% prediction accuracy. Finds 75% of bad-faith edits.

Very likely GOOD FAITH
99% prediction accuracy. Finds 90% of good faith edits.

@jmatazzoni I have updated the prototype to reflect the proposed changes. I have questions to better understand the reasons behind these proposals:

1. Why order options from bad to good?

For both groups of filters ("contribution quality" and "user intent"), filters are organised from bad (damaging/bad-faith edits) to good (good/good-faith edits).

Given that one of our project goals is to "Ensure good-faith new editors have more constructive, less discouraging experiences of edit and article review.", we may want to emphasise those filters that help to provide a constructive review.

Considering this, I think it makes sense to order the filters starting by the "good" filters instead, in order to (a) encourage a more constructive behaviour and (b) help discovering a new kind of activity users may not have initially in mind or be used to,

2. Why do we avoid explaining the purpose of filters?

The current descriptions expose percentages to users in a consistent but also repetitive and neutral way. This require users to figure out what do these percentages mean and which are the most adequate for their activity.

I think we need to translate the cold numbers into concepts that are more intuitive to understand for our users. In particular, it should be clear which filter is about being strict (e.g., review only the most clear vandalism) and which one is about covering a much wider spectrum of results (e.g., review all potential vandalism). I created some examples emphasising that:

Contribution quality:

Very likely have problems. Edits with a high probability (80%) of being flawed or damaging.
Likely have problems. Edits considered problematic with medium precision, but including more (50%) of all the potential flawed or damaging edits.
May have problems. Edits considered problematic with lower precision, but covering most (90%) of all the potential flawed or damaging edits.
Very likely good. Edits with a high probability (99%) of being good.

User intent:

Likely bad faith. Edits with a medium probability (50%) of being made in bad-faith.
May be bad faith. Edits considered as bad-faith with lower precision, but covering most (75%) of all the potential bad-faith edits.
Very likely good faith. Edits with a high probability (99%) of being made in good-faith.

As you see in the examples I'm not providing all the values for all the filters like in tabular data but providing a different message (either "precision" or "less precision but more coverage") depending on the nature of the filter (with percentages as clarifications, not the main piece of information).

3. Why do we think the confusion is between the good and bad concepts?

All filter groups include either opposed concepts ("Registered" vs. "Unregistered") or a set of concepts in a scale ("Newcomer, Experienced, More experienced").
When you describe the need to emphasise the distinction between good and bad, which are the issues you expect users to get into? Not finding the "good" filter because it is after 3 "bad" filters? Picking the wrong filter by mistake? It would be good to capture those concerns in the research plan to better understand how users are affected by them.

In the prototype I have adjusted the bold style to make the good/bad contrast more apparent, but I'm not sure we need to emphasise more strongly that difference since it does not seem an exception to what happens with other filter groups.

I think that good and bad are clear opposites in meaning. The part I found most problematic, as described above, is how to map the probability qualifiers (likely, very likely, maybe) with the different usecases.

4. "Problems" or "have problems"?

In your proposal I see "Very likely problems" but "May have problems".
In this context, where we have a list of contributions, selecting those that "may have problems" seems to read more natural than selecting "very likely problems".

In the prototype I kept the "have problems" structure, since I think we aim here more for clarity than brevity. But feel free to provide more details on the intent of the current proposal regarding this.

Getting to Pau's questions above

Why order options from bad to good?

I agree completely with all the points you raise, which is why we've always ordered these with the "good" options first. But up until now, we had two good and two or three bad options. When, based on the data, we dropped down to having only 1 good option, the situation became different.

Previously, the pattern was in a progression, and it was easy to get a feel for how it worked--like this:
+2
+1
-1
-2

It was clear you were on a gradient that went down then up. When there was only one good option then the switch to bad, the pattern was harder to see. like this:

+1
-1
-2

So, to my eye, it made sense to establish that there is an orderly diminution of bad that leads to good. [Also, realistically, I suspect most people are in fact going to be looking for damage rather than its opposite.) See below.

CONTRIBUTION QUALITY [good first]

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

CONTRIBUTION QUALITY [bad first]

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

Why do we think the confusion is between the good and bad concepts?

You've got someone looking at the list of Damaging filters. She'll see three options, they are all finding damage and the variation is in the likelihood of damage--80%, 40%, 15%. She thinks she knows the pattern. Then, suddenly, there is an option that is not finding damage but in fact its opposite.

I think my concern about this may be partly another result from the switch to just one positive option among a list of negative ones. But actually I had this nagging doubt about our arrangement all along. The thing is, these filters are, uniquely, changing on two axes: the probability varies AND the objective switches. I tried to find ways to emphasize both just with wording but failed. So I think usability improves here when the design can highlight the objectives, "good faith" vs "bad faith." . When I look at the crude versions above, with the all caps for the objects of the searches, my gut tells me that this emphasis is helpful. It's bringing out a key element that one would otherwise have to read carefully to understand.

Why do we avoid explaining the purpose of filters?

There are some good ideas here and we can definitely keep working to make the explanations more helpful. Thanks for including the ones I gave you in the prototype, as a starting point.

The problem I found when I tried to avoid the numbers was that I ended up using words that were just more wordy and less precise ways of saying the numbers. So in the end I felt like I wasn't really helping. But I'll go through these suggestions very carefully. It may well be that we don't need to give both figures for all scores.

• jmatazzoni moved this task from Backlog to Design / Product on the Edit-Review-Improvements board.Sep 26 2016, 7:01 PM

In today's Collab Team discussion there were a lot of comments about the ORES language above. People were put off by the numbers, among other issues. I tried a lot of variations, and have TWO suggested arrangements:

SUGGESTION #1
This is the best I can do with the current groupings, I think. Here are what has changed:

In the section headers, we now define "accuracy" in terms of "false positives." This not only helps explain what we mean by accuracy, it also enables us to remove 8 instances of the term "prediction accuracy," from the various options, cutting verbiage significantly.
I removed all numbers and used more relative terms (e.g., "medium accuracy"). This is arguably friendlier, but on reflection it's also probably more appropriate, since we're not really defining the terms enough to meaningfully use such precise % numbers.
I flipped the order of the info in the descriptions so that the recall data goes first before accuracy.This works better, I think, since it gets more quickly to the user's most obvious question--"why on earth would I use a less accurate filter?" Also it doesn't just repeat the accuracy info already provided in the filter name.
Finally, I rearranged the options so that the positive choices come first. BUT, instead of listing the other choices so that we have a (broken and asymmetrical) progression from good to bad, I'm suggesting we list the most accurate tests first, then the less accurate tests.

Here is what I came up with:

CONTRIBUTION QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely GOOD
Finds almost all problem-free edits with very high accuracy.

Very likely HAVE PROBLEMS
Finds a only a third of flawed or damaging edits but with high accuracy.

Likely HAVE PROBLEMS
Finds half of flawed or damaging edits with medium accuracy.

May HAVE PROBLEMS
Finds most flawed or damaging edits but with lower accuracy.

USER INTENT (higher prediction “accuracy” = fewer false positives)

Very likely GOOD FAITH
Finds almost all good-faith edits with a very high accuracy.

Likely BAD FAITH
Finds a quarter of bad-faith edits with medium accuracy.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.

NOTE: See Suggestion #2 below before responding to this.

SUGGESTION #2
This suggestion is more radical--and had one tragic flaw*. Informal testing with my colleagues indicates that the above solution still causes problems because of the way that, under each subhead, the options vary in two dimensions at the same time: the probability varies AND the =/- valence of the objectives reverses. In my tests, people missed the latter--despite it's being right there in bold letters (I added the allcaps after).

I continue to believe information design could be helpful here. But the problem seems to be that people read the descriptions, and while their brains are working out the differences between them, they miss the fact that the filter names have reversed so that you're now searching for the opposite quality.

What's the radical solution? In most meaningful ways, the good faith test and the bad faith test are not the same test but varying somewhat, they are different tests. So why not show them that way? Like so:

BAD EDIT QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely BAD
Finds a only a third of flawed or damaging edits but with high accuracy.

Likely BAD
Finds half of flawed or damaging edits with medium accuracy.

May be BAD
Finds most flawed or damaging edits but with lower accuracy.

GOOD EDIT QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely GOOD
Finds almost all problem-free edits with very high accuracy.

BAD FAITH(higher prediction “accuracy” = fewer false positives)

Likely BAD FAITH
Finds a quarter of bad-faith edits with medium accuracy.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.

GOOD FAITH (higher prediction “accuracy” = fewer false positives)

Very likely GOOD FAITH
Finds almost all good-faith edits with a very high accuracy.

*Reader, have you guessed the tragic flaw? It's this: because of how we've set this all up, imagine someone who wants to filter for Very Likely Bad Faith. He checks the box, and now the system looks for the intersection of that with, you guessed it, Very Likely Good Faith (result = null).

But I think there could be a way around this via design. E.g., here is a rough solution that keeps the tests in their respective functional sections but creates a kind of "subsection" to indicate that there are really two tests here.

• Mattflaschen-WMF subscribed.Sep 26 2016, 11:44 PM

In T146333#2669333, @jmatazzoni wrote:

SUGGESTION #1

Thanks for the updates on this @jmatazzoni. I think suggestion #1 works well.
I think it differentiates the levels well and explains the purpose.

My only suggestion is for the case where we have more similar levels ( x-have problems), where I'd propose to emphasise why just one third are selected. Otherwise, it is not obvious that we are selecting one third because we are being very strict on what we select.

Maybe it's a bit verbose but something like this would help:

Very likely HAVE PROBLEMS
The edits most clearly identified as damaging. Finds only one third of all potentially problematic edits but with high precision.

Following up here on @Pginer-WMF's suggestion:

My only suggestion is...to emphasize why just one third are selected.

I think there may be something here. The fact is, neither precision nor recall is THE important factor. Rather, different filters were designed to prioritize one or the other. By emphasizing the key factor for each filter by naming it first, and switching when appropriate, we might signal the purpose more clearly.

E.g., the first two filters below are about accuracy, after that the rest are about finding more stuff.

CONTRIBUTION QUALITY (higher prediction accuracy = fewer false positives)

Very likely GOOD
Highly accurate at finding almost all problem-free edits.

Very likely HAVE PROBLEMS
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely HAVE PROBLEMS
Finds half of flawed or damaging edits with medium accuracy.

May HAVE PROBLEMS
Finds most flawed or damaging edits but with lower accuracy.

USER INTENT (higher prediction accuracy = fewer false positives)

Very likely GOOD FAITH
Highly accurate at finding almost all good-faith edits.

Likely BAD FAITH
Highly accurate at finding the top 20% of bad-faith edits.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.

What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

Would it be possible to have an exemple (prototype) with real examples, based on real contributions? That would make apprehension of that problem a little bit clearer.

In T146333#2673479, @Trizek-WMF wrote:

Would it be possible to have an exemple (prototype) with real examples, based on real contributions? That would make apprehension of that problem a little bit clearer.

The current prototype provides a set of real contributions, and it makes filters to work realistically for those aspects that were exposed in the current data. Unfortunately, data was not available for all filters to be realistic (in those cases properties are asigned randomly, but trying to be consistent). for example:
"Likely have problems" shows edits that were actually marked as "damaging" by ORES (maybe with a different level than the one we'll be using). For "Very likely have problems" we are just selecting a small subset of the damaging ones, but those do not necesarily have higher ORES scores.
Other filters such as those related to user types are realistic, while other filters such as categories are completely random.

In short: filters are as realistic as we could make them with the current data exposed in recent changes. there is a gap with reality, but it may recreate a close enough experience to reality, at least to learn from it. So it would be very useful if you could use it as if it were real and share your experience.

In T146333#2672431, @jmatazzoni wrote:

Following up here on @Pginer-WMF's suggestion:

What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

The proposal works for me. I think "top 20%" reads quite naturally. There is a risk of users understanding it as the "worst vandalism", but I'm ok in taking that risk if the wording helps to easily get the general purpose.

In T146333#2673539, @Pginer-WMF wrote:

So it would be very useful if you could use it as if it were real and share your experience.

I have to say I'm bluffed by the level of details you have added to that prototype.

It is really convenient to see all filtered edits.
I've mostly tried one filter at the time, to work on quality, adding time to time a new filter to have more accurate results. I'm still confused by the difference between very likely / likely / may. As a user I would prefer a certain outcome; filtering things that are damageable, not one that may have a chance to be but-not-sure.

In T146333#2672431, @jmatazzoni wrote:

What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

For posterity, I raised in standup that I am concerned that "top" may not express exactly what we want.

However, @jmatazzoni responded, and now I'm not sure. My understanding is also that it's not "most damaging". But "top" (meaning "most obvious") may express close enough to what we want. @Halfak may also have a comment on this.

Benoit writes:

I'm still confused by the difference between very likely / likely / may.

Well, then, maybe you DO need the "What's This?" text (that you criticized for being overlong) :-) Here's what it says (currently). Does this help you understand? If not, why not?

USING THE "CONTRIBUTION FILTERS"

Accuracy Tradeoff In general, the more accurate a filter is, the fewer false positives it finds. But there's a tradeoff: stricter filters also find less of the target population than broader, less accurate filters do.
Filtering Levels Because some users value accuracy while others require more inclusive results, filters with different levels of both are provided. Note that some of these levels overlap: the "May have problems" filter is the broadest; "Likely have problems" finds a subset of those results; and the narrowest filter, "Very likely have problems," finds a subset of "Likely."

I use "most obvious" or "most egregious" vandalism when talking about the kind of edits that ClueBot NG is confident enough to revert. The funny thing is that what we're talking about here is what is most obvious to an AI -- which may or may not jibe with what is "most obvious" to a human. In practice, humans and the AI agree about what "obvious" means.

I meant to leave a note about this earlier.

Accuracy Tradeoff

Accuracy is jargon in machine learning. "Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives. "Recall" if the proportion of positives in the set of true-predictions.

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

I'd love to see how much better that article gets once a bunch of ML noobs start reading it critically ;)

@Halfak points out:

"Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives.

I know we're not using the technical ML terms. That's on purpose--and another reason why I think it was right to move away from quoting the exact % numbers. But is the word "accurate"--meaning, in general parlance, "error free," "conforming with the truth" —wrong?

It's at least confusing. I'd side more towards educating our users about the realities of machine prediction rather than glossing over. If users find that it is confusing, that's because it is complicated. We'll only make it more complicated by inventing or misusing terminology.

Now, to answer your actual question, yes "precision" does not equate to "error free" and "conforming to the truth". In this case, we're concerned only with the rate of true positives. The rate of true negatives is also considered when measuring "accuracy" and it should be incorporated with the notion of "conforming to the truth".

Precision is a more apt common word meaning "the state of being precise or exact; exactness.".

In T146333#2674980, @Halfak wrote:

Accuracy is jargon in machine learning. "Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives. "Recall" if the proportion of positives in the set of true-predictions.

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

While we should take care not to use language that would be misleading to people that know ML jargon, the vast majority of users do not know ML jargon. For the detailed explanation saying "this setting has N1% precision and K1% recall, this other one has N2% precision and K2% recall", by all means use the correct jargon words and link to them. But in the more generic, hand-wavy description that doesn't include numbers, I think we should use terms that will be understood by a general audience rather than jargon.

I generally agree, but here we have a word "accuracy" that is both jargon and common language. So it's not trivial to decide what to do. If "accuracy" were not a formally defined metric, I wouldn't be concerned.

In T146333#2674944, @jmatazzoni wrote:

Benoit writes:

Don't forget to keep the task id while quoting. That's a ping.

I'm still confused by the difference between very likely / likely / may.

Well, then, maybe you DO need the "What's This?" text (that you criticized for being overlong) :-)

Overlong does not means not needed. Just too long and exposed to a TL;DR effect. :)

Here's what it says (currently). Does this help you understand? If not, why not?

USING THE "CONTRIBUTION FILTERS"

Accuracy Tradeoff In general, the more accurate a filter is, the fewer false positives it finds. But there's a tradeoff: stricter filters also find less of the target population than broader, less accurate filters do.

Filtering Levels Because some users value accuracy while others require more inclusive results, filters with different levels of both are provided. Note that some of these levels overlap: the "May have problems" filter is the broadest; "Likely have problems" finds a subset of those results; and the narrowest filter, "Very likely have problems," finds a subset of "Likely."

The question is not if I understand that definition (honestly, I had to re-read it multiple times to get it :/), but to have simple terms that people will understand immediately (people will not go somewhere else to read docs). I may be related to my English but the use different adjectives would be clearer to me than levels associated to "likely". I was incapable to understand the concept and translate it into French.
I like @Halfak's idea of using "most obvious" and "most egregious", because they are different terms.

Based on suggestion #2, I would label things like this:

Obviously BAD
Finds a only a third of flawed or damaging edits but with high accuracy (a few false-positives).

BAD
Finds half of flawed or damaging edits with medium accuracy (some false-positives).

May be BAD
Finds most flawed or damaging edits but with lower accuracy (more false-positives).

Or something like that.

In T146333#2674980, @Halfak wrote:

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

I would say no, because that article is not available in many languages and quality may be uncertain. If we use that term, let's define it carefully in the glossary and add the link for further reference.

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016); removed Collab-Team-Q1-July-Sep-2016.Oct 4 2016, 10:34 PM

• jmatazzoni moved this task from Untriaged to Product/Design Work on the Collaboration-Team-Triage (Collab-Team-Q2-Oct-Dec-2016) board.

Ordering / Design of ORES filter options cont.

In recent tests, more than one user has commented on being surprised by the order of the ORES filters -- thinking they should go from bad to good on a continuum, rather than the current arrangement, shown here:

[Current Arrangement]
Very likely good
Highly accurate at finding almost all problem-free edits.

Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

While it's tempting to just give the users what they ask for, we do well to consider how we got to this point. One thing we've seen very clearly is that user do not READ the labels--even the boldfaced ones. They scan them to find what they think is the pattern, then interpolate. The root problem here is that the ORES filters, uniquely in this toolset, vary on two properties: both likelihood (likely, very likely, etc.) and +/- valence (good faith/bad faith).

Because of this pattern-finding behavior, an arrangement like this would not work. Users would tend to overlook the "Good" option at bottom--because they would assume that the last option was simply the least likely to be bad.

[This wouldnt' work because good/bad switch is hidden]
Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

Very likely good
Highly accurate at finding almost all problem-free edits.

The following arrangement might solve one of the users' complaints. But it is my belief that it won't fully resolve the primary difficulty. Many users have missed the point of these filters at first because they simply didn't see that they were both BAD and GOOD filters. I tend to agree with the user test subject kylietastic, who observed:

“one idea i've seen, in some programs where they wanted to group the two extremes together at the top the good and the bad, is to have some some graphical or color representation, such as a green blob and a long red blob and a shorter blob and so on.”

@Pginer-WMF, can we explore something like that?

[We could try this, but....]
Very likely good
Highly accurate at finding almost all problem-free edits.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

In terms of ordering I'd consider the following aspects: (a) make the values work like a continuous scale (this is a logical expectation that we are breaking by reversing the "damaging" ones), and (b) start with the value that we want to encourage and provides more contrast with the rest (i.e., starting with good prevents it from getting lost at the end for users that after reading the first couple options assume this is just about "damaging").

That is:

Very likely good
May have problems
Likely have problems
Very likely have problems

I think this is the simplest ordering, and I'd go with it and evaluate in the beta feature with real data if we need to (a) reduce the number of levels or (b) add additional clarifying elements.

Let's try it, but I still have reservations...

• jmatazzoni updated the task description. (Show Details)Nov 1 2016, 7:04 PM

• jmatazzoni mentioned this in T149734: Implement functionality for RC page 'Contribution Quality' filters (ORES).Nov 1 2016, 7:39 PM

• jmatazzoni mentioned this in T149761: Fine-tune and finalize ORES score ranges for the Quality and Intent filters.Nov 2 2016, 12:21 AM

• jmatazzoni mentioned this in T149853: Implement functionality for RC page 'User Intent' filters (ORES).Nov 2 2016, 9:53 PM

• jmatazzoni closed this task as Resolved.Nov 8 2016, 11:35 PM

Trizek-WMF mentioned this in T146669: Create dedicated pages for ERI Recent Changes Beta project.Nov 9 2016, 2:14 PM

Trizek-WMF added a parent task: T146669: Create dedicated pages for ERI Recent Changes Beta project.

• jmatazzoni mentioned this in T151970: Implement new precision-based test stats for editquality models.Nov 30 2016, 12:35 AM

awight moved this task from Monitor (long term) to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 3 2017, 5:51 PM

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJul 3 2017, 5:51 PM