Page MenuHomePhabricator

Implement "thresholds", deprecate "pile of tests_stats"
Closed, ResolvedPublic

Description

Problem: We have a pile of test_stats defined for the Makefile in order to help users pick thresholds.

Solution:

Implement a new model_info field for ScoringModel called "thresholds".

Put the full range of score thresholds in there with key test statistics

  • precision
  • recall
  • filter rate

Allow the clients to choose how to optimize specific thresholds.

[17:29:13] <halfak> OK so I have an idea that I'm going to brain about real quick. 
[17:29:31] <halfak> We should split test_stats from thresholds in revscoring. 
[17:30:39] <halfak> thresholds are proving very useful to the ERI development team (e.g. RoanKattouw) as we had expected.  Threshold statistics are currently included with basic stats (like ROC-AUC) in test_stats.
[17:31:30] <halfak> But thresholds look different.  For example, we'll configure many different "filter_rate_at_recall" stats because we'd like to set different thresholds at different recall levels. 
[17:32:04] <halfak> So I think we should create a new thing called a "threshold". 
[17:32:19] <RoanKattouw> Don't mind me bloating your Makefile with all sorts of different recall_at_precision() stats... ;)
[17:32:47] <halfak> :D  that's a thing too.  If we want that many stats, we should have them.  However, they shouldn't be considered bloating. 
[17:33:12] <RoanKattouw> Are you suggesting that the APIs should be separate so I can get thresholds without ROC-AUC noise and vice versa?
[17:33:13] <halfak> I don't mind them in the makefile, but I think the JSON response in ORES is overwhelming to read through and I'd rather it wasn't. 
[17:33:20] <RoanKattouw> Right
[17:33:27] <halfak> Vice versa. 
[17:33:42] <halfak> I want ROC-AUC to be easy to get separate from the thresholds. 
[17:34:19] <halfak> The cool thing about "thresholds" is that it makes sense to set them manually, or tie them to a statistics.  Or ask them to optimize some statistics. 
[17:35:04] <halfak> OK, idea 2.  There's a thing call "thresholds" in the fitness metric methods called "thresholds" as well. 
[17:35:21] <halfak> This stores a set of thresholds at which the model will make predictions with a few nice guarantees. 
[17:35:54] <RoanKattouw> What kind of guarantees?
[17:36:01] <RoanKattouw> (BTW, https://github.com/wiki-ai/editquality/pull/63 )
[17:36:19] <halfak> Not too long of a vector, but adequately covers the real values a model produces. 
[17:36:41] <halfak> This is complex because a model's score is not guaranteed to be normally or uniformly distributed. 
[17:36:49] <RoanKattouw> Right
[17:37:17] <halfak> So, we store all the reasonable thresholds in some sort of array along with all of the 4 interesting test statistics. 
[17:38:07] <halfak> So a user could optimize after-the-fact.
[17:38:54] * AndyRussG is now known as AndyRussG|bassoo
[17:39:06] <halfak> So at every threshold, we store precision, recall, and filter_rate
[17:39:17] <halfak> You optimize how you like with the client. 
[17:39:21] <RoanKattouw> Ooh, that sounds great
[17:39:23] <halfak> :) 
[17:39:38] <halfak> I think it'll make it all easier. 
[17:39:45] <RoanKattouw> Are you saying we'd get that info for every threshold at an interval like 0.1 or 0.05, subject to excluding ranges that the model doesn't really ever reach?
[17:40:08] <halfak> Basically, yeah. 
[17:40:16] <RoanKattouw> Oh that would be excellent
[17:40:38] <halfak> So, in the short term, RoanKattouw, I suggest you continue to do what you are doing. 
[17:40:43] <RoanKattouw> I would basically not have to bother you ever again, and would have enough granularity to do all sorts of things
[17:41:36] <halfak> \o/ better for everyone.  Except it's fun to have you join us in this channel. :D
[17:42:01] <halfak> Anyway, I think you shoudl continue to propose a mess of test stats and in the meantime, I'm going to try to put this idea together in some tasks. 
[17:42:07] <RoanKattouw> Excellent
...
[18:31:02] <RoanKattouw> halfak: Shorter term, you said you could rebuild the models with my new stats tonight, when would they finish building?
[18:32:14] <halfak> Maybe tomorrow. :)  Assuming we didn't mess anything up when updating the file in the meantime. :) 
[18:32:45] <halfak> Oh man.  That's another benefit.  We won't need to rebuild models to incorporate new test threshold-level statistcis.

Event Timeline

I was just looking at T159196. I think that we can use this to re-scale our outputs to intuitive values so that 50% really means "50% precision". Really, we could scale the threshold to any scale we like or just give the user the thresholds and let them re-scale it. Line interpolation functions are common, but it'd annoy the user to have to deal with that.

Python: https://docs.scipy.org/doc/numpy/reference/generated/numpy.interp.html
Javascript: https://bl.ocks.org/mbostock/3310323

I just pushed a bunch of stuff to the branch. Here's what model information looks like now:

GradientBoosting(max_depth=3, scale=false, min_samples_split=2, presort="auto", min_weight_fraction_leaf=0.0, max_leaf_nodes=null, loss="deviance", random_state=null, init=null, max_features=null, center=false, min_samples_leaf=1, n_estimators=100, verbose=0, subsample=1.0, learning_rate=0.1, warm_start=false):
 - version: None
 - trained: 2017-04-18T14:12:55.361410

Enviornment:
	 - platform: 'Linux-4.4.0-71-generic-x86_64-with-Ubuntu-16.04-xenial'
	 - machine: 'x86_64'
	 - version: '#92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017'
	 - system: 'Linux'
	 - processor: 'x86_64'
	 - python_build: ('default', 'Mar 30 2016 22:46:26')
	 - python_compiler: 'GCC 5.3.1 20160330'
	 - python_branch: ''
	 - python_implementation: 'CPython'
	 - python_revision: ''
	 - python_version: '3.5.1+'
	 - release: '4.4.0-71-generic'
	

Statistics:
	counts (n=275):
		label      n         ~False    ~True
		-------  ---  ---  --------  -------
		False    251  -->       251        0
		True      24  -->         0       24
	
	rates:
		          False    True
		------  -------  ------
		sample    0.913   0.087
	
	filter_rate (micro=0.5, macro=0.5):
		  False    True
		-------  ------
		  0.087   0.913
	
	precision (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1
	
	recall (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1
	
	accuracy (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1
	
	f1 (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1
	
	roc_auc (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1
	
	pr_auc (micro=0.998, macro=0.995):
		  False    True
		-------  ------
		  0.999   0.991
	
	thresholds:
		False
			  threshold    match_rate    filter_rate    precision    !precision    recall    !recall    accuracy    fpr     f1    !f1
			-----------  ------------  -------------  -----------  ------------  --------  ---------  ----------  -----  -----  -----
			      0.001         1              0            0.913                   1          0           0.913  1      0.954
			      0.127         0.924          0.076        0.988         1         1          0.875       0.989  0.125  0.994  0.933
			      0.179         0.916          0.084        0.996         1         1          0.958       0.996  0.042  0.998  0.979
			      0.985         0.913          0.087        1             1         1          1           1      0      1      1
			      1             0.902          0.098        1             0.889     0.988      1           0.989  0      0.994  0.941
		True
			  threshold    match_rate    filter_rate    precision    !precision    recall    !recall    accuracy    fpr     f1    !f1
			-----------  ------------  -------------  -----------  ------------  --------  ---------  ----------  -----  -----  -----
			      0             1              0            0.087                   1          0           0.087  1      0.161
			      0.015         0.098          0.902        0.889         1         1          0.988       0.989  0.012  0.941  0.994
			      0.821         0.087          0.913        1             1         1          1           1      0      1      1
			      0.873         0.084          0.916        1             0.996     0.958      1           0.996  0      0.979  0.998
			      0.999         0.076          0.924        1             0.988     0.875      1           0.989  0      0.933  0.994
	
	maximum recall @ precision >= 0.9 (micro=1.0, macro=1.0):
		  False    True
		-------  ------
		      1       1

I'm thinking that we'll want to provide access into the formatted JSON information about a model so someone could select the specific data they want to see in the response.

E.g.: https://ores/v3/scores/enwiki/?models=damaging&model_info=statistics.thresholds.true

or: https://ores/v3/scores/enwiki/?models=damaging&model_info=statistics."maximum recall @ precision >= 0.9"

I had an idea that I think is worth exploring. Let me take another pass at this. I've re-added "(WIP)" to the PR title.