More/better model information and "threshold optimizations"

Today, I'm writing to announce a breaking change in ORES that will come out about a month from now. It will only change how information about prediction models is stored and reported. This information is used by some tools to set thresholds at specified levels of confidence (e.g. "give me the threshold that gives 90% recall"). In this blog post, I'll explain how this is currently done and how it will be done once we deploy the change.

While you read through these examples, you can experiment with https://ores.wikimedia.org (current behavior) and https://ores-misc.wmflabs.org (new behavior). These systems will stay in this state until we deploy the newer version to production (probably around Sept. 20th).

Why you need model_info

So, let's say you are going to use ORES to supply your counter-vandalism tool with "damaging" edit predictions. A prediction looks like this:

"damaging": {
  "score": {
    "prediction": true,
    "probability": {
      "false": 0.04445904933523648,
      "true": 0.9555409506647635
    }
  }
}

That "probability" looks interesting. You'd be tempted to assume that it corresponds to some operational metric of model fitness. E.g. "There's a 95.5% chance that this edit is damaging!" but regretfully, you'd be wrong. This "probability" is a useful measure of the model's confidence but not a useful measure of how the model will work against a stream of new edits from the recent changes feed. It turns out that operational metrics for classifiers like this one are all drawn around thresholds. In truth, you get ~95% precision when you set a threshold at 93% "probability".

This gets even more complicated when you want to set thresholds based on other statistics. E.g. "recall" which is the measure of how much of a target class you match. In vandal patrolling work, we want to make sure that we catch most (if not all) of the vandalism. There's steep tradeoffs in classifiers if we ask for perfection, so let's just set a high bar at 90% recall -- catching 90% of the most egregious vandalism. Where should you set your "probability" threshold in order to do that? It turns out that you should set your threshold at 0.09. Using this, you'll have to review less than 1/5th of the incoming edits and you'll be guaranteed to catch 90% of the damaging edits.

The act of finding the confidence threshold at a specified fitness level is something that we call a threshold optimization and it's something that all of our users want to be able to do. We've been providing this information in a limited and inflexible way for a long time. But this change will make gathering information about a model in a machine-readable way much much easier.

Current model_info behavior

Currently, model_info is static. You can request it by adding ?model_info to your URLs. E.g. https://ores.wikimedia.org/v2/scores/enwiki/damaging?model_info This model information is generated at the time that the model is trained and includes a static set of statistics and threshold optimizations. Here's an example of a threshold optimization for the English Wikipedia damaging model:

"filter_rate_at_recall(min_recall=0.9)": {
  "false": {
    "filter_rate": 0.121,
    "recall": 0.9,
    "threshold": 0.547
  },
  "true": {
    "filter_rate": 0.743,
    "recall": 0.908,
    "threshold": 0.148
  }
}

This block of data says that you can select all edits that score above 0.148 "probability" and expect to catch 91% of the damaging edits.

In order to provide useful thresholds for ORES users, we'd specify them at the time of model train/test. First, we had three thresholds specified: filter_rate_at_recall(min_recall=0.9), filter_rate_at_recall(min_recall=0.75), and recall_at_precision(min_precision=0.9). These threshold optimizations corresponded roughly to "needs review", "likely damaging", and "almost certainly damaging" respectively.

After working with the Collaboration Team on the new RC Filters system for patrolling Special:RecentChanges, the list of threshold optimizations ballooned to include: recall_at_fpr(max_fpr=0.1), recall_at_precision(min_precision=0.15), recall_at_precision(min_precision=0.45), recall_at_precision(min_precision=0.6), recall_at_precision(min_precision=0.75), recall_at_precision(min_precision=0.98), recall_at_precision(min_precision=0.99), and recall_at_precision(min_precision=0.995). This was getting out of control.

So I started work on a new task T162217: Implement "thresholds", deprecate "pile of tests_stats". See the description for a discussion I had with @Catrope to make sure I understood what he and his team needed.

New model_info behavior

So, I hadn't planned on this work, but I thought dealing with it was a really good idea. After all, it would make our users' life easier and my life easier because I wouldn't need to re-train the models every time that a new threshold optimization was needed. I could also take this opportunity to implement some important revscoring stuff I'd been putting off. E.g. T160223: Store the detailed system information inside of model files. , T172566: Include label-specific schemas with model_info, and T163711: Use our own scoring models in `tune` utility. A couple weekends, a holiday, and a hackathon later, I had something that worked. Fun story: I actually fully implemented the system several times and decided to refactor and re-engineer the model_info system entirely. This allowed me to iteratively reduce complexity and coupled-ness.

The new system can currently be tested at https://ores-misc.wmflabs.org. When we ask for ?model_info, we see something that's a little different. I'll make some time in other blog posts to talk about 'environment' and 'score_schema'. For now, I just want to talk about 'statistics' that replaces 'test_stats'.

Digging into "statistics"

The first thing that is different is that we now generate aggregate statistics across output labels.

old (query):

"f1": {
  "OK": 0.99,
  "attack": 0.136,
  "spam": 0.586,
  "vandalism": 0.341
}

new (query):

"f1": {
  "labels": {
    "OK": 0.974,
    "attack": 0.136,
    "spam": 0.586,
    "vandalism": 0.341
  },
  "macro": 0.509,
  "micro": 0.962
}

A macro-average of the label statistics is just a simple average across the reported statistic for each label. (0.974 + 0.136 + 0.586 + 0.341) / 4 = 0.509. The micro-average is a weighted by the number of observations. Since the "OK" class if far more common than any other and gets a relatively high f1 score, the micro-average is much higher than the macro-average.

All types of statistics now have these aggregates by default.

Digging into "thresholds"

OK so what about the thresholds thing that is the whole premise of this blog post? Well, I think you're going to like this. I've built a light-weight querying system into the abstract concept of "thresholds" that will allow you to get whatever threshold you like -- so long as your strategy for getting it involves optimizing one statistic ("maximum filter_rate") and holding another constant ("@ recall >= 0.9").

?model_info=statistics.thresholds.true."maximum filter_rate @ recall >= 0.9":

"thresholds": {
  "true": [
    {
      "!f1": 0.883,
      "!precision": 0.996,
      "!recall": 0.794,
      "accuracy": 0.797,
      "f1": 0.233,
      "filter_rate": 0.77,
      "fpr": 0.206,
      "match_rate": 0.23,
      "precision": 0.134,
      "recall": 0.901,
      "threshold": 0.09295862121864444
    }
  ]
}

Here, you can see that we get the same information back, but we're allowed to choose arbitrary optimizations and have the system report back to us where we should place our thresholds.

I asked @Catrope to put together a task for me to demo how I'd just this system to get the optimizations he needs. See T173019. This will require me to request multiple optimizations at the same time. Here's the full URL:

?model_info=statistics.thresholds.true."maximum filter_rate @ recall >= 0.9"|statistics.thresholds.true."maximum recall @ precision >= 0.15"

Which gives us:

"thresholds": {
  "true": [
    {
      "!f1": 0.883,
      "!precision": 0.996,
      "!recall": 0.794,
      "accuracy": 0.797,
      "f1": 0.233,
      "filter_rate": 0.77,
      "fpr": 0.206,
      "match_rate": 0.23,
      "precision": 0.134,
      "recall": 0.901,
      "threshold": 0.09295862121864444
    },
    {
      "!f1": 0.906,
      "!precision": 0.993,
      "!recall": 0.833,
      "accuracy": 0.834,
      "f1": 0.256,
      "filter_rate": 0.81,
      "fpr": 0.167,
      "match_rate": 0.19,
      "precision": 0.151,
      "recall": 0.838,
      "threshold": 0.14750910213671917
    }
  ]
}

So there you have it! There's lots more you can do with this model_info system, but we'll need to save that for another blog post. For now, let us know if you have concerns with the new threshold optimization scheme.

The deployment plan

This announcement blog post is the first step of our deployment plan. We'll be reaching out to @Catrope, @Petrb, @Ragesoss, and other developers who use ORES to make sure that they know this change is coming over the next week. A week from now (Sept. 5th), we'll deploy the new model_info system to https://ores.wmflabs.org and https://ores-beta.wmflabs.org. Then we'll wait at least two weeks and confirm that adaptations have been made to the tools that we know about before finally deploying to https://ores.wikimedia.org (~Sept. 20th)

Written by Halfak on Aug 29 2017, 10:41 PM.
Principal Research Scientist
awight added a comment.EditedAug 30 2017, 7:10 AM

@Halfak Great introduction, and thanks for helping figure out how to diagram. Here are some example illustrations that we can copy into the post once the backend has settled.

https://github.com/adamwight/thresholds_diagrams