Problem: We have a pile of test_stats defined for the Makefile in order to help users pick thresholds.
Solution:
Implement a new model_info field for ScoringModel called "thresholds".
Put the full range of score thresholds in there with key test statistics
- precision
- recall
- filter rate
Allow the clients to choose how to optimize specific thresholds.
[17:29:13] <halfak> OK so I have an idea that I'm going to brain about real quick. [17:29:31] <halfak> We should split test_stats from thresholds in revscoring. [17:30:39] <halfak> thresholds are proving very useful to the ERI development team (e.g. RoanKattouw) as we had expected. Threshold statistics are currently included with basic stats (like ROC-AUC) in test_stats. [17:31:30] <halfak> But thresholds look different. For example, we'll configure many different "filter_rate_at_recall" stats because we'd like to set different thresholds at different recall levels. [17:32:04] <halfak> So I think we should create a new thing called a "threshold". [17:32:19] <RoanKattouw> Don't mind me bloating your Makefile with all sorts of different recall_at_precision() stats... ;) [17:32:47] <halfak> :D that's a thing too. If we want that many stats, we should have them. However, they shouldn't be considered bloating. [17:33:12] <RoanKattouw> Are you suggesting that the APIs should be separate so I can get thresholds without ROC-AUC noise and vice versa? [17:33:13] <halfak> I don't mind them in the makefile, but I think the JSON response in ORES is overwhelming to read through and I'd rather it wasn't. [17:33:20] <RoanKattouw> Right [17:33:27] <halfak> Vice versa. [17:33:42] <halfak> I want ROC-AUC to be easy to get separate from the thresholds. [17:34:19] <halfak> The cool thing about "thresholds" is that it makes sense to set them manually, or tie them to a statistics. Or ask them to optimize some statistics. [17:35:04] <halfak> OK, idea 2. There's a thing call "thresholds" in the fitness metric methods called "thresholds" as well. [17:35:21] <halfak> This stores a set of thresholds at which the model will make predictions with a few nice guarantees. [17:35:54] <RoanKattouw> What kind of guarantees? [17:36:01] <RoanKattouw> (BTW, https://github.com/wiki-ai/editquality/pull/63 ) [17:36:19] <halfak> Not too long of a vector, but adequately covers the real values a model produces. [17:36:41] <halfak> This is complex because a model's score is not guaranteed to be normally or uniformly distributed. [17:36:49] <RoanKattouw> Right [17:37:17] <halfak> So, we store all the reasonable thresholds in some sort of array along with all of the 4 interesting test statistics. [17:38:07] <halfak> So a user could optimize after-the-fact. [17:38:54] * AndyRussG is now known as AndyRussG|bassoo [17:39:06] <halfak> So at every threshold, we store precision, recall, and filter_rate [17:39:17] <halfak> You optimize how you like with the client. [17:39:21] <RoanKattouw> Ooh, that sounds great [17:39:23] <halfak> :) [17:39:38] <halfak> I think it'll make it all easier. [17:39:45] <RoanKattouw> Are you saying we'd get that info for every threshold at an interval like 0.1 or 0.05, subject to excluding ranges that the model doesn't really ever reach? [17:40:08] <halfak> Basically, yeah. [17:40:16] <RoanKattouw> Oh that would be excellent [17:40:38] <halfak> So, in the short term, RoanKattouw, I suggest you continue to do what you are doing. [17:40:43] <RoanKattouw> I would basically not have to bother you ever again, and would have enough granularity to do all sorts of things [17:41:36] <halfak> \o/ better for everyone. Except it's fun to have you join us in this channel. :D [17:42:01] <halfak> Anyway, I think you shoudl continue to propose a mess of test stats and in the meantime, I'm going to try to put this idea together in some tasks. [17:42:07] <RoanKattouw> Excellent ... [18:31:02] <RoanKattouw> halfak: Shorter term, you said you could rebuild the models with my new stats tonight, when would they finish building? [18:32:14] <halfak> Maybe tomorrow. :) Assuming we didn't mess anything up when updating the file in the meantime. :) [18:32:45] <halfak> Oh man. That's another benefit. We won't need to rebuild models to incorporate new test threshold-level statistcis.