Damaging levels on Polish Wikipedia overlap too much
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jmatazzoni
	Mar 28 2017, 10:21 PM

Description

On Polish Wikipedia, set the Quality filters as shown in the screenshots below

Expected Results: Green (Very likely good) should blend only with yellow (May be bad).
Actual results: Orange (Likely bad) frequently blends with Green (Very likely good).

Screen Shot 2017-03-28 at 2.33.03 PM.png (699×737 px, 195 KB)

Screen Shot 2017-03-28 at 2.32.33 PM.png (590×997 px, 324 KB)

Roan looked up the thresholds in Polish and they are, in fact, much different from English Wikipedia:

V. likely good: 0-86
May be bad: 7-100
Likely bad: 37-100
V. likely bad: 73-100

The issue here is that the mathematical model doesn't match the user expectations we've set up in the interface. Beyond that, it's actually possible for an edit to be both V. likely good and V. likely bad. What is a user to make of such a classification?

Clearly our assumptions are based on our experience with en.wiki. Here are the problems/solutions that seem possible here:

There's simply a problem with our math/code, and the thresholds are being set improperly (please let it be this!).
The precision/recall targets we established based on en.wiki don't translate to other wikis, and we need adjust them on a per-wiki basis.
The interface assumptions we made based on en.wiki don't translate, and we need to customize the interface on a per-wiki basis (e.g., because certain wikis simply won't support three levels of damage).

@Halfak, @SBisson @Catrope, please comment. I think we need to have a plan here soonest so we an understand what we're looking at before we roll this out to the next batch of wikis.

Related Objects
Search...

Status	Assigned	Task
Duplicate	Qgil	T125545 Phabricator Q&A session for Community Liaisons
Resolved	Qgil	T116025 Goal: Align Community Liaison and Developer Relations project management practices
Resolved	Qgil	T119387 Community Liaison and Developer Relation quarterly goals for January - March 2016
Open	None	T121500 Unify product documentation for users to make it easier to share, translate and edit
Resolved	Johan	T128790 Translation strategy – act and refine
Resolved	Trizek-WMF	T129088 Create an easy to maintain glossary to facilitate documentation translation (help pages and technical documentation)
Resolved	• jmatazzoni	T145875 Create and maintain Edit Review Improvements documentation
Resolved	• DannyH	T171977 Annual Plan 2017-2018, Audiences 5: Increase current editor retention and engagement
Resolved	• DannyH	T171981 Annual Plan 2017-2018, Audiences 5, Goal 2: Give better ways to monitor contributions
Resolved	• jmatazzoni	T157642 Graduate New Filters UX out of beta on Recent Changes on ALL wikis
Resolved	• jmatazzoni	T144458 Launch ERI RC page features as a Beta Feature to all wikis
Resolved	• jmatazzoni	T150715 Release strategy for RC page improvements: what wikis get the new features when?
Resolved	Trizek-WMF	T146669 Create dedicated pages for ERI Recent Changes Beta project
Resolved	• jmatazzoni	T141449 Define Edit Review Improvements glossary
Resolved	• jmatazzoni	T145157 Research current "new user" definitions and consider whether we need a different name for the ReviewStream and RC page “new user” flag / filter
Resolved	• jmatazzoni	T151477 Improve Filters for Special:Recent Changes documentation page on mediawiki.org
Resolved	Trizek-WMF	T154889 Mark Edit Review Improvements glossary for translation
Resolved	Pginer-WMF	T147632 Prototype an improved version of Recent Change designs
Resolved	• jmatazzoni	T146333 Research how to present ORES scores to users in a way that is understandable and meets their reviewing goals
Resolved	Catrope	T150959 Integrate a feedback page link in Recent Changes Beta filters
Resolved	Pginer-WMF	T160063 Explore ways to represent visually the ORES-related filters and associated tradeoffs
Resolved	• jmatazzoni	T161015 Add screenshots to the help pages for Recent Changes Filters Beta project
Open	None	T142782 Explore process for turning on RCPatrol for English and other relevant wikis
Resolved	Trizek-WMF	T158004 Release RC Page filtering to non-ORES wikis
Resolved	Trizek-WMF	T158225 Enable the ORES good faith and damaging UI by default, on wikis that have these ORES models available (instead of behind a Beta Feature)
Resolved	Trizek-WMF	T159223 Inform communities about the release of the ORES good faith and damaging UI by default
Resolved	Trizek-WMF	T146972 Announce and follow up with communities about the New Filters for Recent changes Beta deployment
Resolved	Trizek-WMF	T158336 Announce and follow up with community group 1 about the New Filters for Recent changes Beta deployment
Resolved	Trizek-WMF	T158042 Followup with Polish Wikipedia about testing ERI filters for Recent Changes
Resolved	• jmatazzoni	T161655 Damaging levels on Polish Wikipedia overlap too much
Resolved	Catrope	T161767 Add more values to test_stats
Resolved	Catrope	T161706 Review ORES prediction visibility on wikis where they are enabled by default
Resolved	SBisson	T161888 Make ORES prediction disappear when the edit is reviewed by someone else with Flagged Revisions

Event Timeline

• jmatazzoni created this task.Mar 28 2017, 10:21 PM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptMar 28 2017, 10:21 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• jmatazzoni triaged this task as High priority.Mar 28 2017, 10:34 PM

Trizek-WMF added a parent task: T158042: Followup with Polish Wikipedia about testing ERI filters for Recent Changes.Mar 29 2017, 7:40 AM

What's the use-case for finding the Very likely good edits? We should tailor our stats/thresholds to that use-case. E.g. by Very likely good do we mean "not likely to be bad" (as in not needing review) or do we mean "these are all mostly good". When you aim for a precision 0.98, you're not setting the bar very high because only 5% of edits are damaging. If you just considered all edits as "very likely good" that would give you a precision of 0.95.

If we're looking to support "not likely to be bad" (as in not needing review), then I'd put everything that doesn't fall into the "maybe bad" bucket (which is about 75% of the good stuff). Here, overlap is bad and doesn't make sense.

If we're looking to support "these are all mostly good", we should set a precision threshold that is much higher than the rate of occurrence of non-damage. E.g. 99% or 99.9% precision. Here, overlap is OK because we're doing something different.

If we're reconsidering how we set our thresholds, I'd like to again suggest we consider the recall based approach as that's what we have been using for quite a while and it's been working pretty nicely.

Currently we set the damaging thresholds using:

filter_rate_at_recall(min_recall=0.9) = 0.645
filter_rate_at_recall(min_recall=0.75) = 0.854
max(recall_at_precision(min_precision=0.9), filter_rate_at_recall(min_recall=0.75)) = 0.854

We get a little bit of overlap because our model for plwiki has a very high level of fitness. A PR-AUC of 0.92 is extremely high compared to enwiki's PR-AUC of 0.39.

Oh yes. Just one more note. We can always have custom test statistics on a per-model basis. It gets a little bit hard to maintain, but it's not crazy. We could have a set of statistics for extremely-high-fitness models that differ from other models.

In T161655#3140407, @Halfak wrote:

If we're looking to support "not likely to be bad" (as in not needing review), then I'd put everything that doesn't fall into the "maybe bad" bucket (which is about 75% of the good stuff). Here, overlap is bad and doesn't make sense.

This is what makes more sense to me. I think the usecases we are targeting are about those edits that we are enough sure for (a) reviewers looking for vandalism to ignore and (b) editors welcoming new good editors to focus in order to thank.

The overlapping (or whatever it is) has been noticed (among other things) on Polish Wikipedia.

@Halfak pertinently asks, what's the use case for the Very Likely Good category. He also makes an excellent suggestion about recall-based thresholds. The key point with regard to these questions, I think, is that different filters have different use cases, and may require different types of thresholds. Here's how I break it down:

V. likely good: The use case here is focused on precision. No one needs to find all the good. But users would like, for example, to know which edits they can safely ignore. Since good is so easy to find, we should aim for a very high precision of 99 or 99.9%
May be bad:The use case here is focused on recall. The user wants to catch almost all bad while excluding what is clearly not bad. So, the threshold should be approx 90% recall.
Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.
V. likely bad: The use case here is about precision: the user wants to see the worst of the worst, and does not want a lot of false positives. My ideal target would be 80% precision, to allow for a higher recall than the current 8%.

So really, the one clear case for a recall-based filter is May be bad, which is precisely meant to sweep up most of the trash.

In T161655#3141159, @jmatazzoni wrote:

@Halfak pertinently asks, what's the use case for the Very Likely Good category. He also makes an excellent suggestion about recall-based thresholds. The key point with regard to these questions, I think, is that different filters have different use cases, and may require different types of thresholds. Here's how I break it down:

V. likely good: The use case here is focused on precision. No one needs to find all the good. But users would like, for example, to know which edits they can safely ignore. Since good is so easy to find, we should aim for a very high precision of 99 or 99.9%

Agreed. The current model stats only let us do 98%, but if @Halfak et al were to add stats for 99% or 99.9% the stats output I'd use those in a heartbeat.

May be bad:The use case here is focused on recall. The user wants to catch almost all bad while excluding what is clearly not bad. So, the threshold should be approx 90% recall.

Agreed, though we may have to tweak that 90% number for high fitness vs low fitness filters.

Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.

We could define this as choosing either 40% precision or 50% recall, whichever is stricter (or looser? not sure yet).

V. likely bad: The use case here is about precision: the user wants to see the worst of the worst, and does not want a lot of false positives. My ideal target would be 80% precision, to allow for a higher recall than the current 8%.

Note that this 8% recall for 96% precision that we have for enwiki is an artefact of the enwiki model being low fitness. The plwiki model gets 91.5% precision with 87.5% recall (technical term: with both of its hands tied behind its back). So here we might either want to have different thresholds for low vs high fitness models, or have a recall-based threshold as a proxy (like, peg recall at 10% or 15%). I think I would prefer the former, because for high fitness models a low recall peg would gratuitously sacrifice lots of recall for a very small increase in precision.

So really, the one clear case for a recall-based filter is May be bad, which is precisely meant to sweep up most of the trash.

@Halfak, just to know, is it posible for communities to do the ORES training again, to adjust possible evolution? It may be a question I'll have to answer.

@jmatazzoni, put a task on our board and we'll update the test statistics.

@Catrope for the high recall condition, I don't think we want that to be model-dependent. Regardless patrollers need to catch (nearly all) of the damage.

@Trizek-WMF yes. We can always add more train/test observations.

In T161655#3141393, @Halfak wrote:

@jmatazzoni, put a task on our board and we'll update the test statistics.

@Catrope for the high recall condition, I don't think we want that to be model-dependent. Regardless patrollers need to catch (nearly all) of the damage.

@Trizek-WMF yes. We can always add more train/test observations.

I will put a task on your board to ask for some new stats. However, is there a script or something I can use to explore many possible values for precision/recall minima and what the stats output (precision, recall, threshold) would be for those values? That would allow us to decide between e.g. 99% and 99.9% and various other things without having to bother you.

Catrope created subtask T161767: Add more values to test_stats.Mar 29 2017, 9:55 PM

In T161655#3141159, @jmatazzoni wrote:

Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.

I think this should be more strongly about precision. If a user looking for vandalism with a filter named "likely bad" gets a list of 100 edits with half of them not being vandalism, the user may think the filter just does not work. I think we should aim for users getting a list with most of the edits being vandalism, even if that implies a low recall.

I don't think the low recall is a problem since (a) users will have the option of using the recall-based "may be bad" filter and (b) even with a high precision we have enough edits to fill the list of recent changes for users to review.

Trizek-WMF mentioned this in T158042: Followup with Polish Wikipedia about testing ERI filters for Recent Changes.Apr 3 2017, 12:45 PM

I think ["Likely bad"] should be more strongly about precision.

Fair point. "Precision" literally means "the likelihood that something flagged as bad is actually bad". Then again, maybe it's a bad name for what we're trying to achieve. Currently, ORES Review Tool users are getting a "needs review" language and that seems to make sense for everyone involved. I think the thresholds are all about a tradeoff between two competing metrics and by trying to simplify them away, we run into these weird logical corners. IMO patrolling is about "reviewing the things that need review" and indicators of the precision of the prediction are helpful in prioritizing work.

even with a high precision we have enough edits to fill the list of recent changes for users to review.

Is this true for all wikis? I'm sure it is true for the big ones.

Catrope moved this task from Untriaged to In Development on the Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017) board.Apr 5 2017, 6:53 PM

Catrope closed subtask T161767: Add more values to test_stats as Resolved.Apr 8 2017, 12:48 AM

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Apr 13 2017, 3:10 PM

Halfak moved this task from Parked to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.Apr 13 2017, 3:22 PM

• jmatazzoni edited projects, added Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017); removed Collaboration-Team-Triage (Collab-Team-Q3-Jan-Mar-2017).Apr 18 2017, 1:06 AM

• jmatazzoni moved this task from Untriaged to In Development on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.

Catrope moved this task from In Development to QA Review on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.May 3 2017, 11:08 PM

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?

some of wikidata are ORES scored, but the majority are not. The same as with the previous not-marked edits - wikidata ores_classification has such edits as scored and others that displayed not-scored are not in the ores table.

Screen Shot 2017-05-04 at 2.50.53 PM.png (534×1 px, 278 KB)

'Very likely good' and 'Very likely bad faith' filters return 18 results for 30 days selection (and 500 results per display). So the overlapping between those filters is quite noticeable.

Screen Shot 2017-05-04 at 3.34.12 PM.png (681×1 px, 261 KB)

In T161655#3237208, @Etonkovidova wrote:

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?

There is now anti-overlap between the damaging categories on plwiki, so there are edits that are not in any category:

> $stats=  ORES\Stats::newFromGlobalState();

> var_Dump($stats->getThresholds('damaging'))
array(3) {
  ["likelybad"]=>
  array(2) {
    ["min"]=>
    float(0.617)
    ["max"]=>
    int(1)
  }
  ["verylikelybad"]=>
  array(2) {
    ["min"]=>
    float(0.852)
    ["max"]=>
    int(1)
  }
  ["likelygood"]=>
  array(2) {
    ["min"]=>
    int(0)
    ["max"]=>
    float(0.472)
  }
}

So for example, an edit with a damaging score of 0.5 would not be likelygood (because that's 0.472 and below) but would also not be likelybad (because that's 0.617 and up).

some of wikidata are ORES scored, but the majority are not. The same as with the previous not-marked edits - wikidata ores_classification has such edits as scored and others that displayed not-scored are not in the ores table.

Yes, that's a known (separate, but I think unfiled?) bug: we try to look up the ORES score for a Wikidata edit by taking the Wikidata revid and treating it as if it was a revid on the local wiki, which doesn't work well at all. Until that is fixed (T158025) it may be better not to show scores for Wikidata edits at all.

'Very likely good' and 'Very likely bad faith' filters return 18 results for 30 days selection (and 500 results per display). So the overlapping between those filters is quite noticeable.

OK, but that's more related to T163995, it's not really about this bug.

In T161655#3237299, @Catrope wrote:

In T161655#3237208, @Etonkovidova wrote:

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?

There is now anti-overlap between the damaging categories on plwiki, so there are edits that are not in any category:

If, however, there are a significant number of edits that are completely unscored (as in have no score in the DB), that would be a problem.

@jmatazzoni

plwiki has only three 'Contribution quality prediction filters': "Very likely good", "Likely have problems", and "Very likely have problems". During my testing

no edits were simultaneously in the result set of "Very likely good" and "Likely have problems" filters
no edits were simultaneously in the result set of "Very likely good" and "Very likely have problems" filters
there are edits marked simultaneously as "Likely have problems" and "Very likely have problems" because it's a natural overlap: "Very likely have problems" is a subset of "Likely have problems" .

QA Recommendation: Resolve

In T161655#3237303, @Etonkovidova wrote:

there are edits marked simultaneously as "Likely have problems" and "Very likely have problems" - not many though.

All edits that are "Very likely have problems" should always also be "Likely have problems", because the former is a subset of the latter.

Etonkovidova moved this task from QA Review to Product Review on the Collaboration-Team-Triage (Collab-Team-Q4-Apr-Jun-2017) board.May 4 2017, 11:17 PM

Trizek-WMF mentioned this in T158336: Announce and follow up with community group 1 about the New Filters for Recent changes Beta deployment.May 5 2017, 10:20 AM

Jdforrester-WMF added a parent task: T158336: Announce and follow up with community group 1 about the New Filters for Recent changes Beta deployment.May 5 2017, 11:10 PM

@Trizek-WMF, have you let the Polish users know that their levels should be optimized now? It might be good to tell the, so they can report if they're still seeing issues (thought what we'd do, I'm not sure...).

awight moved this task from Monitor (long term) to Completed on the Machine-Learning-Team (Active Tasks) board.Jul 3 2017, 5:49 PM

	F7917140: Screen Shot 2017-05-04 at 3.34.12 PM.png
	May 4 2017, 10:36 PM

	F7915679: Screen Shot 2017-05-04 at 2.50.53 PM.png
	May 4 2017, 9:56 PM

	F7060649: Screen Shot 2017-03-28 at 2.32.33 PM.png
	Mar 28 2017, 10:21 PM

	F7060645: Screen Shot 2017-03-28 at 2.33.03 PM.png
	Mar 28 2017, 10:21 PM

Damaging levels on Polish Wikipedia overlap too muchClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Damaging levels on Polish Wikipedia overlap too much
Closed, ResolvedPublic
Actions

Related Objects
Search...