Also make new labeling campaigns if it's too old.
I wrote a script that goes through all models and checks their ROC AUC and sort them based on worst to the best. This is the result:
Now it's time to think what should we do with fiwiki, plwiki, hewiki damaging models :/
On hewiki ORES damaging model works OK, but there is still a room for improvement.
I think it is somewhat too conservative (by too conservative I mean FN, damaging edits that get low probability of ~0.4-0.5).
Currently both goodfaith and damaging are giving weighted scores (to good end). For example, goodfaith is pretty much always 0.95 or better. Same is true with the damaging too but weighting seems not to be that bad.
However, the weighting seems to be systematic and if we ignore that the damaging model is currently better for detecting actual damage than what it was in summer 2017 when it was more biased against IP-editors. Damaging model is currently also better than goodfaith.
As for practical use, my seulojabot currently approves popular culture edits if ORES goodfaith true is >0.95 and damaging true <0.15 and non-BLP edits if goodfaith true >0.99 and damaging true <0.015 and those limits seems to be unproblematic.
Ping @4shadoww , how about stabilizerbot stats?
I think the models have improved from the summer 2017. From my experience the damaging model doesn't do much if any false positives, which I think is a good thing. But in the other hand it does a lot of false negatives. So it doesn't detect damaging edits properly. Far as I know goodfaith model is just not very accurate as it makes both false positives and negatives.
Stabilizerbot makes mostly mistakes because of some other method of detecting harmful edits than ores. Ores seems to make pretty rarely false positives as the bot requires both models goodfaith and damaging scores to be true < 0.15 and false > 0.825 and false < 0.15 and true > 0.825.