Page MenuHomePhabricator

Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals
Closed, ResolvedPublic

Description

By dividing the spectrum of ORES scores into discrete levels and labeling these in a way that makes their predictive value clear, enable users to effectively use the system to meet their reviewing needs. In the immediate future, this system will be used on the ERI revisions to Recent Changes and in the ReviewStream feed.

As of Nov. 1 2016, here's how we're defining and describing these levels :

CONTRIBUTION QUALITY [Damaging]

Very likely good
Highly accurate at finding almost all problem-free edits.
May have problems
Finds most flawed or damaging edits but with lower accuracy.
Likely have problems
Finds half of flawed or damaging edits with medium accuracy.
Very likely have problems
Highly accurate at finding the most obvious 30% of flawed or damaging edits.

USER INTENT [Good Faith]

Very likely good faith
Highly accurate at finding almost all good-faith edits.
May be bad faith
Finds most bad-faith edits but with a lower accuracy.
Likely bad faith
Highly accurate at finding most obvious obvious 20% of bad-faith edits.

Related Objects

StatusSubtypeAssignedTask
Resolved DannyH
Resolved DannyH
Resolved jmatazzoni
Resolved jmatazzoni
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedSBisson
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
Resolved Mattflaschen-WMF
ResolvedSBisson
Resolved jmatazzoni
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
Resolved jmatazzoni
ResolvedMooeypoo
Resolved jmatazzoni
ResolvedSBisson
ResolvedMooeypoo
ResolvedMooeypoo
ResolvedMooeypoo
Resolved jmatazzoni
Resolved jmatazzoni
OpenNone
ResolvedMooeypoo
Resolved jmatazzoni
ResolvedMooeypoo
ResolvedMooeypoo
DeclinedNone
ResolvedMooeypoo
ResolvedMooeypoo
Resolved Mattflaschen-WMF
Resolved Mattflaschen-WMF
Resolved jmatazzoni
ResolvedNone
InvalidNone
ResolvedSBisson
ResolvedMooeypoo
Resolved jmatazzoni
ResolvedSBisson
Resolved Catrope
Resolved jmatazzoni
ResolvedSBisson
ResolvedPginer-WMF
DuplicateNone
ResolvedPginer-WMF
Resolved jmatazzoni
ResolvedTrizek-WMF
Resolved jmatazzoni
Resolved Catrope

Event Timeline

These are the CORRECTED versions of the tables @Halfak created in T146280. There are also annotated versions that show where certain thresholds might be set in this Google Sheets document.

Damaging

ScorePrecisionNeg prec.RecallTrue-neg. rate
95%100.0%96.2%5.5%100%
85%56.1%97.3%35.0%98.9%
75%43.4%97.7%46.5%97.5%
65%34.0%98.2%59.0%95.2%
55%29.2%98.7%72.0%92.7%
45%25.0%98.9%77.1%90.4%
35%20.9%99.2%82.8%87.2%
25%17.7%99.4%87.3%83.2%
15%14.9%99.5%91.1%78.2%
5%9.2%99.8%97.5%60.0%

Good faith

ScorePrecisionNeg Prec.RecallTrue-neg. rate
95%99.9%8.1%70.7%96.0%
85%99.7%12.4%82.5%91.9%
75%99.6%15.2%87.5%84.8%
65%99.4%18.8%90.9%77.8%
55%99.2%22.9%93.7%69.7%
45%99.1%28.3%95.5%65.7%
35%98.9%36.3%97.2%59.2%
25%98.6%41.4%98.3%44.4%
15%98.1%49.1%99.3%26.0%
5%97.7%65.8%99.9%7.4%

For those not following the whole discussion, when thinking about this the concepts I found more helpful are:

  • Precision. How accurate or strict we are when selecting contributions given a criteria. The more precise, the less false positives we'll have. For example, a 99.9% precision prediction of good-faith edits will give you good-faith edits most of the time with very few (0.1%) of edits that are not good. The false positives are labelled as "1-Prec" in the tables.
  • Coverage ("Recall" in the tables). How many of the relevant contributions we are considering given a criteria. For example, the 99,9% precision filter is so strict that it will catch only 70.7% of the potential good-faith edits. For activities where users want to explore all possible good-faith edits (even if that imply being exposed to more false positives) a filter with a wider coverage may be needed.

Precision and coverage are obviously contradicting requirements. The more strict you are selecting items that are very sure meet your criteria (higher precision), the more items you are leaving out of the selection (lower recall).

This annotated Sheets document reproduces the number tables above and shows the proposed threshold levels in the context of the full range of score data (levels are approximate and a bit fudged because of a desire to round off numbers).

Below please find the proposed UI text for the various levels. I tried numerous variations on the wording and arrangement. In particular, I tried to see if I could communicate the information without resorting to % numbers. In the end I judged that a failure, in part because the levels themselves aren't symmetrical—there are an unequal number of positive and negative filters—and partly because the accuracy and recall numbers the levels represent vary in asymmetrical and inconsistent ways. Also, we believe it to be the case that vandalism fighters, for example, will in fact want to know what % of vandalism is identified; saying "some" isn't helpful, and saying "a third" is just a less concise way to say 30%.

All of which is to say the numbers are perhaps not the most satisfying feature here, but I believe they serve a purpose. And with a little polishing I hope this will be good enough to test. (I suspect that if we want to improve the descriptions further, some kind of data visualization may be needed—symbols for false positives vs. recall or some such.)

Two last related changes to the previous UI descriptions are worth noting: 1) You'll now find only 1 positive threshold for both tests; the numbers, all in the 90%s, should explain why, and 2) because of that, the order of filters has been reversed, so that the negative qualities are now at the top. Although we prefer to accentuate the positive, with only 1 positive filter for each test, it was much harder to see that the filters are on a continuum when that 1 filter went first.

And with that, here is the suggested language:

CONTRIBUTION QUALITY

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

USER INTENT

Likely bad faith
50% prediction accuracy. Finds 25% of bad-faith edits.

May be bad faith
20% prediction accuracy. Finds 75% of bad-faith edits.

Very likely good faith
99% prediction accuracy. Finds 90% of good faith edits.

Ready for a page on Mediawiki.org?

Use design to emphasize search objects and +/- switches they undergo
I was looking at the version of the ORES language above (which I put there Fri.). And I finally put my finger on something that was bothering me about that arrangement: it doesn't adequately emphasize an incredibly important aspect of the functionality. Which is, that the filters change from BAD to GOOD. The object of the searches literally reverses. E.g., the user is looking for problem edits, then, suddenly, for edits with no problems. And the same for good faith. The results are very different! But the options look similar and are under the same rubric.

By emphasizing the objects of the searches and the -/+ switches they go through, we make it much easier for users to apprehend what the tools do.
@Pginer-WMF, for testing, can you please use design to emphasize the object dimension for both filter groups. Use color, typography, graphics, etc.

E.g, see the example below, There I've used capitalization as a crude but effective way to highlight this important aspect. I think it's already a big improvement.

CONTRIBUTION QUALITY

Very likely PROBLEMS
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely PROBLEMS
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have PROBLEMS
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely GOOD
99% prediction accuracy. Finds 92% of problem-free edits

USER INTENT

Likely BAD FAITH
50% prediction accuracy. Finds 25% of bad-faith edits.

May be BAD FAITH
20% prediction accuracy. Finds 75% of bad-faith edits.

Very likely GOOD FAITH
99% prediction accuracy. Finds 90% of good faith edits.

@jmatazzoni I have updated the prototype to reflect the proposed changes. I have questions to better understand the reasons behind these proposals:

1. Why order options from bad to good?

For both groups of filters ("contribution quality" and "user intent"), filters are organised from bad (damaging/bad-faith edits) to good (good/good-faith edits).

Given that one of our project goals is to "Ensure good-faith new editors have more constructive, less discouraging experiences of edit and article review.", we may want to emphasise those filters that help to provide a constructive review.

Considering this, I think it makes sense to order the filters starting by the "good" filters instead, in order to (a) encourage a more constructive behaviour and (b) help discovering a new kind of activity users may not have initially in mind or be used to,

2. Why do we avoid explaining the purpose of filters?

The current descriptions expose percentages to users in a consistent but also repetitive and neutral way. This require users to figure out what do these percentages mean and which are the most adequate for their activity.

I think we need to translate the cold numbers into concepts that are more intuitive to understand for our users. In particular, it should be clear which filter is about being strict (e.g., review only the most clear vandalism) and which one is about covering a much wider spectrum of results (e.g., review all potential vandalism). I created some examples emphasising that:

Contribution quality:

  • Very likely have problems. Edits with a high probability (80%) of being flawed or damaging.
  • Likely have problems. Edits considered problematic with medium precision, but including more (50%) of all the potential flawed or damaging edits.
  • May have problems. Edits considered problematic with lower precision, but covering most (90%) of all the potential flawed or damaging edits.
  • Very likely good. Edits with a high probability (99%) of being good.

User intent:

  • Likely bad faith. Edits with a medium probability (50%) of being made in bad-faith.
  • May be bad faith. Edits considered as bad-faith with lower precision, but covering most (75%) of all the potential bad-faith edits.
  • Very likely good faith. Edits with a high probability (99%) of being made in good-faith.

As you see in the examples I'm not providing all the values for all the filters like in tabular data but providing a different message (either "precision" or "less precision but more coverage") depending on the nature of the filter (with percentages as clarifications, not the main piece of information).

3. Why do we think the confusion is between the good and bad concepts?

All filter groups include either opposed concepts ("Registered" vs. "Unregistered") or a set of concepts in a scale ("Newcomer, Experienced, More experienced").
When you describe the need to emphasise the distinction between good and bad, which are the issues you expect users to get into? Not finding the "good" filter because it is after 3 "bad" filters? Picking the wrong filter by mistake? It would be good to capture those concerns in the research plan to better understand how users are affected by them.

In the prototype I have adjusted the bold style to make the good/bad contrast more apparent, but I'm not sure we need to emphasise more strongly that difference since it does not seem an exception to what happens with other filter groups.

I think that good and bad are clear opposites in meaning. The part I found most problematic, as described above, is how to map the probability qualifiers (likely, very likely, maybe) with the different usecases.

4. "Problems" or "have problems"?

In your proposal I see "Very likely problems" but "May have problems".
In this context, where we have a list of contributions, selecting those that "may have problems" seems to read more natural than selecting "very likely problems".

In the prototype I kept the "have problems" structure, since I think we aim here more for clarity than brevity. But feel free to provide more details on the intent of the current proposal regarding this.

Getting to Pau's questions above

  1. Why order options from bad to good?

I agree completely with all the points you raise, which is why we've always ordered these with the "good" options first. But up until now, we had two good and two or three bad options. When, based on the data, we dropped down to having only 1 good option, the situation became different.

Previously, the pattern was in a progression, and it was easy to get a feel for how it worked--like this:
+2
+1
-1
-2

It was clear you were on a gradient that went down then up. When there was only one good option then the switch to bad, the pattern was harder to see. like this:

+1
-1
-2

So, to my eye, it made sense to establish that there is an orderly diminution of bad that leads to good. [Also, realistically, I suspect most people are in fact going to be looking for damage rather than its opposite.) See below.

CONTRIBUTION QUALITY [good first]

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

CONTRIBUTION QUALITY [bad first]

Very likely problems
80% prediction accuracy. Finds 30% of flawed or damaging edits.

Likely problems
40% prediction accuracy. Finds 50% of flawed or damaging edits.

May have problems
15% prediction accuracy. Finds 90% of flawed or damaging edits.

Very likely good
99% prediction accuracy. Finds 92% of problem-free edits

  1. Why do we think the confusion is between the good and bad concepts?

You've got someone looking at the list of Damaging filters. She'll see three options, they are all finding damage and the variation is in the likelihood of damage--80%, 40%, 15%. She thinks she knows the pattern. Then, suddenly, there is an option that is not finding damage but in fact its opposite.

I think my concern about this may be partly another result from the switch to just one positive option among a list of negative ones. But actually I had this nagging doubt about our arrangement all along. The thing is, these filters are, uniquely, changing on two axes: the probability varies AND the objective switches. I tried to find ways to emphasize both just with wording but failed. So I think usability improves here when the design can highlight the objectives, "good faith" vs "bad faith." . When I look at the crude versions above, with the all caps for the objects of the searches, my gut tells me that this emphasis is helpful. It's bringing out a key element that one would otherwise have to read carefully to understand.

  1. Why do we avoid explaining the purpose of filters?

There are some good ideas here and we can definitely keep working to make the explanations more helpful. Thanks for including the ones I gave you in the prototype, as a starting point.

The problem I found when I tried to avoid the numbers was that I ended up using words that were just more wordy and less precise ways of saying the numbers. So in the end I felt like I wasn't really helping. But I'll go through these suggestions very carefully. It may well be that we don't need to give both figures for all scores.

In today's Collab Team discussion there were a lot of comments about the ORES language above. People were put off by the numbers, among other issues. I tried a lot of variations, and have TWO suggested arrangements:

SUGGESTION #1
This is the best I can do with the current groupings, I think. Here are what has changed:

  • In the section headers, we now define "accuracy" in terms of "false positives." This not only helps explain what we mean by accuracy, it also enables us to remove 8 instances of the term "prediction accuracy," from the various options, cutting verbiage significantly.
  • I removed all numbers and used more relative terms (e.g., "medium accuracy"). This is arguably friendlier, but on reflection it's also probably more appropriate, since we're not really defining the terms enough to meaningfully use such precise % numbers.
  • I flipped the order of the info in the descriptions so that the recall data goes first before accuracy.This works better, I think, since it gets more quickly to the user's most obvious question--"why on earth would I use a less accurate filter?" Also it doesn't just repeat the accuracy info already provided in the filter name.
  • Finally, I rearranged the options so that the positive choices come first. BUT, instead of listing the other choices so that we have a (broken and asymmetrical) progression from good to bad, I'm suggesting we list the most accurate tests first, then the less accurate tests.

Here is what I came up with:

CONTRIBUTION QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely GOOD
Finds almost all problem-free edits with very high accuracy.

Very likely HAVE PROBLEMS
Finds a only a third of flawed or damaging edits but with high accuracy.

Likely HAVE PROBLEMS
Finds half of flawed or damaging edits with medium accuracy.

May HAVE PROBLEMS
Finds most flawed or damaging edits but with lower accuracy.

USER INTENT (higher prediction “accuracy” = fewer false positives)

Very likely GOOD FAITH
Finds almost all good-faith edits with a very high accuracy.

Likely BAD FAITH
Finds a quarter of bad-faith edits with medium accuracy.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.

NOTE: See Suggestion #2 below before responding to this.

SUGGESTION #2
This suggestion is more radical--and had one tragic flaw*. Informal testing with my colleagues indicates that the above solution still causes problems because of the way that, under each subhead, the options vary in two dimensions at the same time: the probability varies AND the =/- valence of the objectives reverses. In my tests, people missed the latter--despite it's being right there in bold letters (I added the allcaps after).

I continue to believe information design could be helpful here. But the problem seems to be that people read the descriptions, and while their brains are working out the differences between them, they miss the fact that the filter names have reversed so that you're now searching for the opposite quality.

What's the radical solution? In most meaningful ways, the good faith test and the bad faith test are not the same test but varying somewhat, they are different tests. So why not show them that way? Like so:

BAD EDIT QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely BAD
Finds a only a third of flawed or damaging edits but with high accuracy.

Likely BAD
Finds half of flawed or damaging edits with medium accuracy.

May be BAD
Finds most flawed or damaging edits but with lower accuracy.

GOOD EDIT QUALITY (higher prediction “accuracy” = fewer false positives)

Very likely GOOD
Finds almost all problem-free edits with very high accuracy.

BAD FAITH(higher prediction “accuracy” = fewer false positives)

Likely BAD FAITH
Finds a quarter of bad-faith edits with medium accuracy.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.

GOOD FAITH (higher prediction “accuracy” = fewer false positives)

Very likely GOOD FAITH
Finds almost all good-faith edits with a very high accuracy.

*Reader, have you guessed the tragic flaw? It's this: because of how we've set this all up, imagine someone who wants to filter for Very Likely Bad Faith. He checks the box, and now the system looks for the intersection of that with, you guessed it, Very Likely Good Faith (result = null).

But I think there could be a way around this via design. E.g., here is a rough solution that keeps the tests in their respective functional sections but creates a kind of "subsection" to indicate that there are really two tests here.

SUGGESTION #1

Thanks for the updates on this @jmatazzoni. I think suggestion #1 works well.
I think it differentiates the levels well and explains the purpose.

My only suggestion is for the case where we have more similar levels ( x-have problems), where I'd propose to emphasise why just one third are selected. Otherwise, it is not obvious that we are selecting one third because we are being very strict on what we select.

Maybe it's a bit verbose but something like this would help:

Very likely HAVE PROBLEMS
The edits most clearly identified as damaging. Finds only one third of all potentially problematic edits but with high precision.

Following up here on @Pginer-WMF's suggestion:

My only suggestion is...to emphasize why just one third are selected.

I think there may be something here. The fact is, neither precision nor recall is THE important factor. Rather, different filters were designed to prioritize one or the other. By emphasizing the key factor for each filter by naming it first, and switching when appropriate, we might signal the purpose more clearly.

E.g., the first two filters below are about accuracy, after that the rest are about finding more stuff.

CONTRIBUTION QUALITY (higher prediction accuracy = fewer false positives)

Very likely GOOD
Highly accurate at finding almost all problem-free edits.

Very likely HAVE PROBLEMS
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely HAVE PROBLEMS
Finds half of flawed or damaging edits with medium accuracy.

May HAVE PROBLEMS
Finds most flawed or damaging edits but with lower accuracy.


USER INTENT (higher prediction accuracy = fewer false positives)

Very likely GOOD FAITH
Highly accurate at finding almost all good-faith edits.

Likely BAD FAITH
Highly accurate at finding the top 20% of bad-faith edits.

May be BAD FAITH
Finds most bad-faith edits but with a lower accuracy.


What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

Would it be possible to have an exemple (prototype) with real examples, based on real contributions? That would make apprehension of that problem a little bit clearer.

Would it be possible to have an exemple (prototype) with real examples, based on real contributions? That would make apprehension of that problem a little bit clearer.

The current prototype provides a set of real contributions, and it makes filters to work realistically for those aspects that were exposed in the current data. Unfortunately, data was not available for all filters to be realistic (in those cases properties are asigned randomly, but trying to be consistent). for example:
"Likely have problems" shows edits that were actually marked as "damaging" by ORES (maybe with a different level than the one we'll be using). For "Very likely have problems" we are just selecting a small subset of the damaging ones, but those do not necesarily have higher ORES scores.
Other filters such as those related to user types are realistic, while other filters such as categories are completely random.

In short: filters are as realistic as we could make them with the current data exposed in recent changes. there is a gap with reality, but it may recreate a close enough experience to reality, at least to learn from it. So it would be very useful if you could use it as if it were real and share your experience.

Following up here on @Pginer-WMF's suggestion:

What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

The proposal works for me. I think "top 20%" reads quite naturally. There is a risk of users understanding it as the "worst vandalism", but I'm ok in taking that risk if the wording helps to easily get the general purpose.

So it would be very useful if you could use it as if it were real and share your experience.

I have to say I'm bluffed by the level of details you have added to that prototype.

It is really convenient to see all filtered edits.
I've mostly tried one filter at the time, to work on quality, adding time to time a new filter to have more accurate results. I'm still confused by the difference between very likely / likely / may. As a user I would prefer a certain outcome; filtering things that are damageable, not one that may have a chance to be but-not-sure.

What do you think? Note that I've used the word "top" in a few of these -- e.g., "top 20%" It may actually mean "most obvious" rather than "most damaging," but I think this makes better, more intuitive sense. We're skimming off the top...

For posterity, I raised in standup that I am concerned that "top" may not express exactly what we want.

However, @jmatazzoni responded, and now I'm not sure. My understanding is also that it's not "most damaging". But "top" (meaning "most obvious") may express close enough to what we want. @Halfak may also have a comment on this.

Benoit writes:

I'm still confused by the difference between very likely / likely / may.

Well, then, maybe you DO need the "What's This?" text (that you criticized for being overlong) :-) Here's what it says (currently). Does this help you understand? If not, why not?

USING THE "CONTRIBUTION FILTERS"

  • Accuracy Tradeoff In general, the more accurate a filter is, the fewer false positives it finds. But there's a tradeoff: stricter filters also find less of the target population than broader, less accurate filters do.
  • Filtering Levels Because some users value accuracy while others require more inclusive results, filters with different levels of both are provided. Note that some of these levels overlap: the "May have problems" filter is the broadest; "Likely have problems" finds a subset of those results; and the narrowest filter, "Very likely have problems," finds a subset of "Likely."

I use "most obvious" or "most egregious" vandalism when talking about the kind of edits that ClueBot NG is confident enough to revert. The funny thing is that what we're talking about here is what is most obvious to an AI -- which may or may not jibe with what is "most obvious" to a human. In practice, humans and the AI agree about what "obvious" means.

I meant to leave a note about this earlier.

Accuracy Tradeoff

Accuracy is jargon in machine learning. "Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives. "Recall" if the proportion of positives in the set of true-predictions.

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

I'd love to see how much better that article gets once a bunch of ML noobs start reading it critically ;)

@Halfak points out:

"Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives.

I know we're not using the technical ML terms. That's on purpose--and another reason why I think it was right to move away from quoting the exact % numbers. But is the word "accurate"--meaning, in general parlance, "error free," "conforming with the truth" —wrong?

It's at least confusing. I'd side more towards educating our users about the realities of machine prediction rather than glossing over. If users find that it is confusing, that's because it is complicated. We'll only make it more complicated by inventing or misusing terminology.

Now, to answer your actual question, yes "precision" does not equate to "error free" and "conforming to the truth". In this case, we're concerned only with the rate of true positives. The rate of true negatives is also considered when measuring "accuracy" and it should be incorporated with the notion of "conforming to the truth".

Precision is a more apt common word meaning "the state of being precise or exact; exactness.".

Accuracy is jargon in machine learning. "Accuracy" is the proportion of predictions that are right. "Precision" is the proportion of true-predictions that are true-positives. "Recall" if the proportion of positives in the set of true-predictions.

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

While we should take care not to use language that would be misleading to people that know ML jargon, the vast majority of users do not know ML jargon. For the detailed explanation saying "this setting has N1% precision and K1% recall, this other one has N2% precision and K2% recall", by all means use the correct jargon words and link to them. But in the more generic, hand-wavy description that doesn't include numbers, I think we should use terms that will be understood by a general audience rather than jargon.

I generally agree, but here we have a word "accuracy" that is both jargon and common language. So it's not trivial to decide what to do. If "accuracy" were not a formally defined metric, I wouldn't be concerned.

Benoit writes:

Don't forget to keep the task id while quoting. That's a ping.

I'm still confused by the difference between very likely / likely / may.

Well, then, maybe you DO need the "What's This?" text (that you criticized for being overlong) :-)

Overlong does not means not needed. Just too long and exposed to a TL;DR effect. :)

Here's what it says (currently). Does this help you understand? If not, why not?

USING THE "CONTRIBUTION FILTERS"

  • Accuracy Tradeoff In general, the more accurate a filter is, the fewer false positives it finds. But there's a tradeoff: stricter filters also find less of the target population than broader, less accurate filters do.
  • Filtering Levels Because some users value accuracy while others require more inclusive results, filters with different levels of both are provided. Note that some of these levels overlap: the "May have problems" filter is the broadest; "Likely have problems" finds a subset of those results; and the narrowest filter, "Very likely have problems," finds a subset of "Likely."

The question is not if I understand that definition (honestly, I had to re-read it multiple times to get it :/), but to have simple terms that people will understand immediately (people will not go somewhere else to read docs). I may be related to my English but the use different adjectives would be clearer to me than levels associated to "likely". I was incapable to understand the concept and translate it into French.
I like @Halfak's idea of using "most obvious" and "most egregious", because they are different terms.

Based on suggestion #2, I would label things like this:

Obviously BAD
Finds a only a third of flawed or damaging edits but with high accuracy (a few false-positives).

BAD
Finds half of flawed or damaging edits with medium accuracy (some false-positives).

May be BAD
Finds most flawed or damaging edits but with lower accuracy (more false-positives).

Or something like that.

Do you think that Wikipedians may appreciate us using the technical terminology and then linking directly to the article for reference? https://en.wikipedia.org/wiki/Precision_and_recall

I would say no, because that article is not available in many languages and quality may be uncertain. If we use that term, let's define it carefully in the glossary and add the link for further reference.

Ordering / Design of ORES filter options cont.

In recent tests, more than one user has commented on being surprised by the order of the ORES filters -- thinking they should go from bad to good on a continuum, rather than the current arrangement, shown here:

[Current Arrangement]
Very likely good
Highly accurate at finding almost all problem-free edits.

Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

While it's tempting to just give the users what they ask for, we do well to consider how we got to this point. One thing we've seen very clearly is that user do not READ the labels--even the boldfaced ones. They scan them to find what they think is the pattern, then interpolate. The root problem here is that the ORES filters, uniquely in this toolset, vary on two properties: both likelihood (likely, very likely, etc.) and +/- valence (good faith/bad faith).

Because of this pattern-finding behavior, an arrangement like this would not work. Users would tend to overlook the "Good" option at bottom--because they would assume that the last option was simply the least likely to be bad.

[This wouldnt' work because good/bad switch is hidden]
Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

Very likely good
Highly accurate at finding almost all problem-free edits.

The following arrangement might solve one of the users' complaints. But it is my belief that it won't fully resolve the primary difficulty. Many users have missed the point of these filters at first because they simply didn't see that they were both BAD and GOOD filters. I tend to agree with the user test subject kylietastic, who observed:

“one idea i've seen, in some programs where they wanted to group the two extremes together at the top the good and the bad, is to have some some graphical or color representation, such as a green blob and a long red blob and a shorter blob and so on.”

@Pginer-WMF, can we explore something like that?

[We could try this, but....]
Very likely good
Highly accurate at finding almost all problem-free edits.

May have problems
Finds most flawed or damaging edits but with lower accuracy.

Likely have problems
Finds half of flawed or damaging edits with medium accuracy.

Very likely have problems
Highly accurate at finding the top 30% of flawed or damaging edits.

In terms of ordering I'd consider the following aspects: (a) make the values work like a continuous scale (this is a logical expectation that we are breaking by reversing the "damaging" ones), and (b) start with the value that we want to encourage and provides more contrast with the rest (i.e., starting with good prevents it from getting lost at the end for users that after reading the first couple options assume this is just about "damaging").

That is:

  • Very likely good
  • May have problems
  • Likely have problems
  • Very likely have problems

I think this is the simplest ordering, and I'd go with it and evaluate in the beta feature with real data if we need to (a) reduce the number of levels or (b) add additional clarifying elements.

Let's try it, but I still have reservations...