Page MenuHomePhabricator

Updated ORES models can no longer satisfy configured threshold requirements
Open, NormalPublic


For example, the fiwiki goodfaith model is now really bad, and no longer has available thresholds that satisfy precision >= 0.15. This has caused some filters to disappear from Recentchanges completely, and others to become useless. The only reason we noticed is that Special:ORESModels throws notices when encountering this situation (see T205228).

Based on the error log entries produced by T205228, the following models are affected at minimum:

  • fiwiki goodfaith (stats)
  • hewiki goodfaith (stats)
  • fawiki damaging (stats)
  • ruwiki goodfaith (stats)

Really we should reevaluate the thresholds of all models, we've never yet done that after the initial configuration of each model.

Event Timeline

Catrope created this task.Sep 24 2018, 6:34 PM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptSep 24 2018, 6:34 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Adding to @Catrope, we could add a step to the Makefile after writing model stats which would look for obvious shortcomings in the models built and do some type of intervention. I'm not sure it should be a prompt however, because we run this as an ad-hoc batch process and want it to complete without blocking on any one model. For reference, I think we currently crash the entire build on errors anywhere in the make pipeline, so the bar is set pretty low.

Just taking a look at this again. I can explain where the problem may have come from with fiwiki (using flaggedrevs as observations), but the others have surprised me. It could be that by re-tuning and re-training we can get a more reasonable split. It's really in that case that I'm seeing a *serious* problem.

Generally, it seems likely that we'll continue to sometimes be able to satisfy strict statistics and struggle at other times. This is due to non-deterministic effects in model training. In reality, the model will be a bit better than the statistics suggest. Our statistics will get more and more exact as we add new observations to training and testing. This is a big reason why we want to get Jade out. It will be a huge source of data beyond the limited Wikilabels campaigns we run now.

That said, for some of these communities, we're still working with data from 2015/2016 so running a new labeling campaign to get more data wouldn't be out of the question.

Halfak triaged this task as Normal priority.Feb 5 2019, 10:28 PM
Halfak moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.
awight removed a subscriber: awight.Mar 21 2019, 4:04 PM