Page MenuHomePhabricator

Damaging levels on Polish Wikipedia overlap too much
Closed, ResolvedPublic

Description

  • On Polish Wikipedia, set the Quality filters as shown in the screenshots below

Expected Results: Green (Very likely good) should blend only with yellow (May be bad).
Actual results: Orange (Likely bad) frequently blends with Green (Very likely good).

Screen Shot 2017-03-28 at 2.33.03 PM.png (699×737 px, 195 KB)

Screen Shot 2017-03-28 at 2.32.33 PM.png (590×997 px, 324 KB)

Roan looked up the thresholds in Polish and they are, in fact, much different from English Wikipedia:

  • V. likely good: 0-86
  • May be bad: 7-100
  • Likely bad: 37-100
  • V. likely bad: 73-100

The issue here is that the mathematical model doesn't match the user expectations we've set up in the interface. Beyond that, it's actually possible for an edit to be both V. likely good and V. likely bad. What is a user to make of such a classification?

Clearly our assumptions are based on our experience with en.wiki. Here are the problems/solutions that seem possible here:

  1. There's simply a problem with our math/code, and the thresholds are being set improperly (please let it be this!).
  2. The precision/recall targets we established based on en.wiki don't translate to other wikis, and we need adjust them on a per-wiki basis.
  3. The interface assumptions we made based on en.wiki don't translate, and we need to customize the interface on a per-wiki basis (e.g., because certain wikis simply won't support three levels of damage).

@Halfak, @SBisson @Catrope, please comment. I think we need to have a plan here soonest so we an understand what we're looking at before we roll this out to the next batch of wikis.

Related Objects

StatusSubtypeAssignedTask
DuplicateQgil
ResolvedQgil
ResolvedQgil
OpenNone
ResolvedJohan
ResolvedTrizek-WMF
Resolved jmatazzoni
Resolved DannyH
Resolved DannyH
Resolved jmatazzoni
Resolved jmatazzoni
Resolved jmatazzoni
ResolvedTrizek-WMF
Resolved jmatazzoni
Resolved jmatazzoni
Resolved jmatazzoni
ResolvedTrizek-WMF
ResolvedPginer-WMF
Resolved jmatazzoni
ResolvedCatrope
ResolvedPginer-WMF
Resolved jmatazzoni
OpenNone
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
ResolvedTrizek-WMF
Resolved jmatazzoni
ResolvedCatrope
ResolvedCatrope
ResolvedSBisson

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What's the use-case for finding the Very likely good edits? We should tailor our stats/thresholds to that use-case. E.g. by Very likely good do we mean "not likely to be bad" (as in not needing review) or do we mean "these are all mostly good". When you aim for a precision 0.98, you're not setting the bar very high because only 5% of edits are damaging. If you just considered all edits as "very likely good" that would give you a precision of 0.95.

If we're looking to support "not likely to be bad" (as in not needing review), then I'd put everything that doesn't fall into the "maybe bad" bucket (which is about 75% of the good stuff). Here, overlap is bad and doesn't make sense.

If we're looking to support "these are all mostly good", we should set a precision threshold that is much higher than the rate of occurrence of non-damage. E.g. 99% or 99.9% precision. Here, overlap is OK because we're doing something different.

If we're reconsidering how we set our thresholds, I'd like to again suggest we consider the recall based approach as that's what we have been using for quite a while and it's been working pretty nicely.

Currently we set the damaging thresholds using:

  • filter_rate_at_recall(min_recall=0.9) = 0.645
  • filter_rate_at_recall(min_recall=0.75) = 0.854
  • max(recall_at_precision(min_precision=0.9), filter_rate_at_recall(min_recall=0.75)) = 0.854

We get a little bit of overlap because our model for plwiki has a very high level of fitness. A PR-AUC of 0.92 is extremely high compared to enwiki's PR-AUC of 0.39.

Oh yes. Just one more note. We can always have custom test statistics on a per-model basis. It gets a little bit hard to maintain, but it's not crazy. We could have a set of statistics for extremely-high-fitness models that differ from other models.

If we're looking to support "not likely to be bad" (as in not needing review), then I'd put everything that doesn't fall into the "maybe bad" bucket (which is about 75% of the good stuff). Here, overlap is bad and doesn't make sense.

This is what makes more sense to me. I think the usecases we are targeting are about those edits that we are enough sure for (a) reviewers looking for vandalism to ignore and (b) editors welcoming new good editors to focus in order to thank.

The overlapping (or whatever it is) has been noticed (among other things) on Polish Wikipedia.

@Halfak pertinently asks, what's the use case for the Very Likely Good category. He also makes an excellent suggestion about recall-based thresholds. The key point with regard to these questions, I think, is that different filters have different use cases, and may require different types of thresholds. Here's how I break it down:

  • V. likely good: The use case here is focused on precision. No one needs to find all the good. But users would like, for example, to know which edits they can safely ignore. Since good is so easy to find, we should aim for a very high precision of 99 or 99.9%
  • May be bad:The use case here is focused on recall. The user wants to catch almost all bad while excluding what is clearly not bad. So, the threshold should be approx 90% recall.
  • Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.
  • V. likely bad: The use case here is about precision: the user wants to see the worst of the worst, and does not want a lot of false positives. My ideal target would be 80% precision, to allow for a higher recall than the current 8%.

So really, the one clear case for a recall-based filter is May be bad, which is precisely meant to sweep up most of the trash.

@Halfak pertinently asks, what's the use case for the Very Likely Good category. He also makes an excellent suggestion about recall-based thresholds. The key point with regard to these questions, I think, is that different filters have different use cases, and may require different types of thresholds. Here's how I break it down:

  • V. likely good: The use case here is focused on precision. No one needs to find all the good. But users would like, for example, to know which edits they can safely ignore. Since good is so easy to find, we should aim for a very high precision of 99 or 99.9%

Agreed. The current model stats only let us do 98%, but if @Halfak et al were to add stats for 99% or 99.9% the stats output I'd use those in a heartbeat.

  • May be bad:The use case here is focused on recall. The user wants to catch almost all bad while excluding what is clearly not bad. So, the threshold should be approx 90% recall.

Agreed, though we may have to tweak that 90% number for high fitness vs low fitness filters.

  • Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.

We could define this as choosing either 40% precision or 50% recall, whichever is stricter (or looser? not sure yet).

  • V. likely bad: The use case here is about precision: the user wants to see the worst of the worst, and does not want a lot of false positives. My ideal target would be 80% precision, to allow for a higher recall than the current 8%.

Note that this 8% recall for 96% precision that we have for enwiki is an artefact of the enwiki model being low fitness. The plwiki model gets 91.5% precision with 87.5% recall (technical term: with both of its hands tied behind its back). So here we might either want to have different thresholds for low vs high fitness models, or have a recall-based threshold as a proxy (like, peg recall at 10% or 15%). I think I would prefer the former, because for high fitness models a low recall peg would gratuitously sacrifice lots of recall for a very small increase in precision.

So really, the one clear case for a recall-based filter is May be bad, which is precisely meant to sweep up most of the trash.

@Halfak, just to know, is it posible for communities to do the ORES training again, to adjust possible evolution? It may be a question I'll have to answer.

@jmatazzoni, put a task on our board and we'll update the test statistics.

@Catrope for the high recall condition, I don't think we want that to be model-dependent. Regardless patrollers need to catch (nearly all) of the damage.

@Trizek-WMF yes. We can always add more train/test observations.

@jmatazzoni, put a task on our board and we'll update the test statistics.

@Catrope for the high recall condition, I don't think we want that to be model-dependent. Regardless patrollers need to catch (nearly all) of the damage.

@Trizek-WMF yes. We can always add more train/test observations.

I will put a task on your board to ask for some new stats. However, is there a script or something I can use to explore many possible values for precision/recall minima and what the stats output (precision, recall, threshold) would be for those values? That would allow us to decide between e.g. 99% and 99.9% and various other things without having to bother you.

  • Likely bad: This is meant to be the "mama bear" of filters, a middling option. It's more about precision, I think, since reviewers will use it mostly to provide a second cut at prioritizing their efforts. But if that were to yield a very low recall on a particular wiki it might not be so good. In general, this should aim for a precision somewhere in the 40% range, as long as that's consistent with a recall that is also in the 35-50% range.

I think this should be more strongly about precision. If a user looking for vandalism with a filter named "likely bad" gets a list of 100 edits with half of them not being vandalism, the user may think the filter just does not work. I think we should aim for users getting a list with most of the edits being vandalism, even if that implies a low recall.

I don't think the low recall is a problem since (a) users will have the option of using the recall-based "may be bad" filter and (b) even with a high precision we have enough edits to fill the list of recent changes for users to review.

I think ["Likely bad"] should be more strongly about precision.

Fair point. "Precision" literally means "the likelihood that something flagged as bad is actually bad". Then again, maybe it's a bad name for what we're trying to achieve. Currently, ORES Review Tool users are getting a "needs review" language and that seems to make sense for everyone involved. I think the thresholds are all about a tradeoff between two competing metrics and by trying to simplify them away, we run into these weird logical corners. IMO patrolling is about "reviewing the things that need review" and indicators of the precision of the prediction are helpful in prioritizing work.

even with a high precision we have enough edits to fill the list of recent changes for users to review.

Is this true for all wikis? I'm sure it is true for the big ones.

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

  1. On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?
  1. some of wikidata are ORES scored, but the majority are not. The same as with the previous not-marked edits - wikidata ores_classification has such edits as scored and others that displayed not-scored are not in the ores table.

Screen Shot 2017-05-04 at 2.50.53 PM.png (534×1 px, 278 KB)

  1. 'Very likely good' and 'Very likely bad faith' filters return 18 results for 30 days selection (and 500 results per display). So the overlapping between those filters is quite noticeable.

Screen Shot 2017-05-04 at 3.34.12 PM.png (681×1 px, 261 KB)

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

  1. On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?

There is now anti-overlap between the damaging categories on plwiki, so there are edits that are not in any category:

> $stats=  ORES\Stats::newFromGlobalState();

> var_Dump($stats->getThresholds('damaging'))
array(3) {
  ["likelybad"]=>
  array(2) {
    ["min"]=>
    float(0.617)
    ["max"]=>
    int(1)
  }
  ["verylikelybad"]=>
  array(2) {
    ["min"]=>
    float(0.852)
    ["max"]=>
    int(1)
  }
  ["likelygood"]=>
  array(2) {
    ["min"]=>
    int(0)
    ["max"]=>
    float(0.472)
  }
}

So for example, an edit with a damaging score of 0.5 would not be likelygood (because that's 0.472 and below) but would also not be likelybad (because that's 0.617 and up).

  1. some of wikidata are ORES scored, but the majority are not. The same as with the previous not-marked edits - wikidata ores_classification has such edits as scored and others that displayed not-scored are not in the ores table.

Screen Shot 2017-05-04 at 2.50.53 PM.png (534×1 px, 278 KB)

Yes, that's a known (separate, but I think unfiled?) bug: we try to look up the ORES score for a Wikidata edit by taking the Wikidata revid and treating it as if it was a revid on the local wiki, which doesn't work well at all. Until that is fixed (T158025) it may be better not to show scores for Wikidata edits at all.

  1. 'Very likely good' and 'Very likely bad faith' filters return 18 results for 30 days selection (and 500 results per display). So the overlapping between those filters is quite noticeable.

Screen Shot 2017-05-04 at 3.34.12 PM.png (681×1 px, 261 KB)

OK, but that's more related to T163995, it's not really about this bug.

@Catrope can you take a look at these cases? Why we have such discrepancies in ORES scoring?

  1. On plwiki - there are quite few seemingly normal edits that do not have ORES scores. Yes, these edits are not in ores_classificaiton, the UI correctly presents them, but why they were not scored?

There is now anti-overlap between the damaging categories on plwiki, so there are edits that are not in any category:

If, however, there are a significant number of edits that are completely unscored (as in have no score in the DB), that would be a problem.

@jmatazzoni

plwiki has only three 'Contribution quality prediction filters': "Very likely good", "Likely have problems", and "Very likely have problems". During my testing

  • no edits were simultaneously in the result set of "Very likely good" and "Likely have problems" filters
  • no edits were simultaneously in the result set of "Very likely good" and "Very likely have problems" filters
  • there are edits marked simultaneously as "Likely have problems" and "Very likely have problems" because it's a natural overlap: "Very likely have problems" is a subset of "Likely have problems" .

QA Recommendation: Resolve

  • there are edits marked simultaneously as "Likely have problems" and "Very likely have problems" - not many though.

All edits that are "Very likely have problems" should always also be "Likely have problems", because the former is a subset of the latter.

jmatazzoni claimed this task.

@Trizek-WMF, have you let the Polish users know that their levels should be optimized now? It might be good to tell the, so they can report if they're still seeing issues (thought what we'd do, I'm not sure...).