Page MenuHomePhabricator

Implement new precision-based test stats for editquality models
Closed, ResolvedPublic

Description

In T149761, the collaboration team finalized the score ranges that define the 7 ORES filters evolved in T146333. In creating the 7 standardized filtering options for the ORES Quality and Intent filters, we strove to balance users' desires for accuracy versus breadth of coverage. To balance these factors, we used the numerical tables @Halfak created in T146280, which correlate ranges of ORES damaging and good-faith scores with their predicted precision and coverage stats. (This spreadsheet provides a more full-featured display of this data.)

We need to know these scoring ranges for all wikis where the MediaWiki-extensions-ORES is deployed.

Below find the following for each of the 7 ORES filters on en.wiki: in square brackets, the score ranges we've settled on; in parentheses the precision and coverage stats a given range will produce. The goal of this task is to find the score ranges that will produce similar results in each of the 8 target wikis (see list below).

Quality filters

  • Very likely good [0%-55%] (98.7% precision, 92.7% coverage)
  • May Have problems [16%-100%] (14.9% precision, 91.1% coverage)
  • Likely Have Problems [75%-100%] (43%% precision, coverage 46%)
  • Very Likely Have Problems [??-100%] (??)

User Intent Filters

  • Very likely good faith [35% -100%] (98.9 accuracy, 97.2 coverage)
  • May be bad faith [0% - 65%] (77% coverage, accuracy 18.8)
  • Likely bad faith [0%-15%] (49.1 precision, coverage 26%)

Target Wikis

  • English Wikipedia
  • Persian Wikipedia
  • Dutch Wikipedia
  • Polish Wikipedia
  • Portuguese Wikipedia
  • Russian Wikipedia
  • Turkish Wikipedia
  • Wikidata

Event Timeline

Aaron, did I describe the task properly? And may I assign it to you?

Halfak renamed this task from Calculate ORES score thresholds for 7 more wikis to Implement new precision-based test stats for editquality models.Nov 30 2016, 1:01 AM
Halfak claimed this task.
Halfak updated the task description. (Show Details)

@jmatazzoni may I suggest that we use the already implemented 90% precision threshold for "Very Likely Have Problems"?

For the rest of the thresholds, we can implement them using the recall_at_precision metric.

Quality filter

  • Very likely good -- recall_at_precision(0.98) (False)
  • May Have problems -- recall_at_precision(.15) (True)
  • Likely Have Problems -- recall_at_precision(.43) (True)
  • Very Likely Have Problems recall_at_precision(.90) (True)

User Intent Filters

  • Very likely good faith recall_at_precision(.99) (True)
  • May be bad faith recall_at_precision(.18) (False)
  • Likely bad faith recall_at_precision(.49) (False)

Implementing this should only require an edit at the top of the editquality Makefile and regenerating all of the models. See https://github.com/wiki-ai/editquality/blob/master/Makefile#L57

It seems like we're employing different test statistics for damaging and goodfaith models. I don't advise this as it'll clutter the test statistics we generate. Maybe we could find a happy compromise between these sets.

  • Very likely good (98%), Very likely good faith (99%), Suggested compromise (98%)
  • May Have problems (15%), May be bad faith (18%), Suggested compromise (15%)
  • Likely Have Problems (43%), Likely bad faith (49%), Suggested compromise (45%)

If no compromise is possible, we can split the variable for test_statistics into damaging_statistics and goodfaith_statistics and go from there.

@Halfak writes

Maybe we could find a happy compromise between these sets.

If this makes implementation and maintenance easier, it seems a reasonable compromise. The biggest change would seem to be that "Likely bad faith" would go from 26% coverage to, what, about 35%—but with a precision loss of only 4 point. I don't see a problem.

Questions:

  • Does this mean that when specifying these filters, I no longer need to state a score range, as above?
  • Does this relate to (help solve?) the problem stated in T152161?
Halfak moved this task from Done to Active on the Scoring-platform-team (Current) board.

This should allow you to automatically update thresholds for use in a UI. I think it does help solve that problem.

Re Aaron's suggestion:

may I suggest that we use the already implemented 90% precision threshold for "Very Likely Have Problems"?

We discussed this in n T149761, where @Halfak wrote

To get 90% precision with the damaging model, set the threshold at 0.94. That will capture 8.3% of damaging edits

I'm having a hard time answering. On one hand, the other "very likely" filters are in the 90s for precision. And, if we imagine a user highlighting results using a filter with the quoted 8.3% recall rate, on a results page of 500 edits the user would see 42 colored edits. Which is not insubstantial.

On the other hand, I can't help but feel a recall rate below 10% is a little low. What we're promising with this filter is that the results will be "highly accurate." For me, being right 8 out of 10 times (as opposed to 9 out of 10) would still qualify. And my guess is that users would appreciate the increase in recall--assuming it would be substantial enough. (In my mind, the perfect balance would be more like 80/20)

I suppose one response to Aaron's suggestion is to ask what your comment that the 90% rate is "already implemented" means? If making a new rate is a lot of work, then I suppose we can try the 90% level. But if the work involved in making a change is not so great, I'd still be interested in knowing, as asked in T149761, what kind of stats we'd get at thresholds of .93, .92 or .91? (Or, conversely, what the recall rate would be if we hit that 80% precision mark?)

Halfak added a comment.EditedDec 7 2016, 10:58 PM

@jmatazzoni, I understand that you want to squeeze more fitness out of the models than we can currently provide. I'm not sure what to tell you. We're working hard to improve the fitness of the models but we can only do so well.

When it comes down to it, I've recommended in the past that you not use the word "accurate" as it is an inappropriate use of a technical jargon. I believe you are now running into the issue of using the technical definition to try to interpret a non-technical use of the terminology. If we instead say that the "very likely" thresholds are "very precise", the language issues you are worried about don't exist.

It's important that you also note that models for wikis that aren't English Wikipedia will have different precision/recall dynamics. So you'll have to re-do this exercise again on those wikis. The jargon that has been created for describing the fitness of classifiers is not arbitrary. It was invented specifically to resolve the mental backflips you're trying to perform now.

what kind of stats we'd get at thresholds of .93, .92 or .91?

I think it's a waste of time for me to answer this question as it will change as soon as we make the next improvement to the models. I'd much rather spend my very limited time improving the models than to produce more statistics that will be immediately outdated with the next model we deploy.

may I suggest that we use the already implemented 90% precision threshold for "Very Likely Have Problems"?

OK. I updated the definitions at T149761 to reflect this figure.

If you feel that the interface language we're using (see T149385 ) is inaccurate or misleading, now is the time to talk about it. I'd be happy to set up a meeting with Pau to discuss.

I'm looking at implementing the change described above in the editquality Makefile.

If I understand, we should replace

test_statistics = \
		-s 'table' -s 'accuracy' -s 'precision' -s 'recall' \
		-s 'pr' -s 'roc' \
		-s 'recall_at_fpr(max_fpr=0.10)' \
		-s 'filter_rate_at_recall(min_recall=0.90)' \
		-s 'filter_rate_at_recall(min_recall=0.75)'

with

damaging_statistics = \
		-s 'table' -s 'accuracy' -s 'precision' -s 'recall' \
		-s 'pr' -s 'roc' \
		-s 'recall_at_fpr(max_fpr=0.10)' \
		-s 'recall_at_precision(min_recall=0.98)' \
		-s 'recall_at_precision(min_recall=0.90)' \
		-s 'recall_at_precision(min_recall=0.43)' \
		-s 'recall_at_precision(min_recall=0.15)'

and

goodfaith_statistics = \
		-s 'table' -s 'accuracy' -s 'precision' -s 'recall' \
		-s 'pr' -s 'roc' \
		-s 'recall_at_fpr(max_fpr=0.10)' \
		-s 'recall_at_precision(min_recall=0.99)' \
		-s 'recall_at_precision(min_recall=0.49)' \
		-s 'recall_at_precision(min_recall=0.18)'

Then, we'll fetch the thresholds using the following URLs?

https://ores.wikimedia.org/v2/scores/enwiki/damaging/?model_info=damaging_stats
https://ores.wikimedia.org/v2/scores/enwiki/goodfaith/?model_info=goodfaith_stats

In the examples in T151970#2833839, the lines have (true) or (false) at the end. Does it affect the configuration or usage of the thresholds?

Very likely good faith recall_at_precision(.99) (True)
May be bad faith recall_at_precision(.18) (False)

A couple of changes recall_at_precision(min_recall=0.15) to recall_at_precision(min_precision=0.15).

Test statistics will come via "https://ores.wikimedia.org/v2/scores/enwiki/damaging/?model_info=test_stats" regardless of the variable name in the Makefile :)

In the examples in T151970#2833839, the lines have (true) or (false) at the end. Does it affect the configuration or usage of the thresholds?

Oh yes, so there are going to be different thresholds for true and false. E.g. when we want precision=0.90 for catching damage, we'll look at the statistic for the precision of the "true" outcome. But when we want to have precision=0.90 for avoiding damaging edits, we'll look at the precision of the "false" outcome. Does that make sense?

Halfak reassigned this task from Halfak to SBisson.Dec 21 2016, 6:31 PM

@SBisson, I hope you don't mind me assigning this to you. I figured it made sense because you've started work. Please ping here when you have a PR and I'll review right away.

FYI: PR is here: https://github.com/wiki-ai/editquality/pull/54

I'm rebuilding models with the test stats now. I expect this to be done tomorrow. Then I'll merge and do a deployment on ores.wmflabs.org.

BTW, this is now deployed on Wikimedia Labs. See https://ores.wmflabs.org/v2/scores/?model_info=test_stats

We'll need to wait for a deployment window to deploy to ores.wikimedia.org.

@Halfak What's needed for the v2 api to be deployed everywhere? Any way I can help?

Right now, we'll need to put a lot of work into a substantial deployment change. It's mostly maintenance scripts to run to re-populate database tables in MediaWiki. I need @Ladsgroup's help in order to run that, so it'll depend on his schedule and a deploy window to get the work done.

If the new system isn't going to change score of edits but it's going to change the thresholds, we don't need to run the maintenance script but we do need a deployment of config changes to the mediawiki which we can do in our deployment window.

Halfak closed this task as Resolved.Feb 7 2017, 8:31 PM